Jun 26, 2020

Scaling Release Systems

As products grow, so does the requirement to ship more often, which is desirable as it enables continuous feature delivery in a low-friction scenario.

Getting to the point where you can release changes multiples times a day, often even an hour, all without having to worry about breaking production requires a few conditions to be met upfront.

🪐 Ensure Reproducible Builds and Deployments

Whether your code is running locally, in test environments, or in production, it should operate the same. This is often easier said than done, environments might differ in the way deployments are done, services are set up, or in the availability of third-party services used. Every piece that your user might experience differently, though, can lead to unexpected behavior.

Based on this, it's recommended to make sure environments stay aligned and as consistent as possible. Sure, you might not be able to run the full range of AWS services locally (on that note, you're not the first person with this problem and there are near infinite tools and resources to solve this issue). Especially when it comes to service configuration, the fewer variations possible, the easier it gets to comprehend what gets deployed which way and how to debug across environments if something goes wrong.

What this boils down to is that simplicity is key. Using platform defaults such as using AWS roles passed into your environment instead of generated API credentials that have to be declared explicitly every time also helps to declutter your configuration.

For truly reproducible deployments, it is also important that your build doesn't change mid-way: When using containers, this can be mitigated easily by distributing and running the same image in production that was verified to work in staging or development environments. This makes sure you're running the exact same configuration across deployments. Using dependency locking mechanisms, whether it's go.sum for Go, yarn.lock or package-lock.json for Node.js, or whichever method your language and tooling provide you with, can also help to prevent third-party code from changing.

In the end, we want to make sure that there's no way for a change to update when moving through our stages from development to production. Code that is tested and runs in development one way should be executed the exact same way in production if we want to make use of the validation and testing work. Otherwise, we'd be flying blind, allowing breaking changes to be introduced without us knowing, or unexpected configuration to mess with our expectations.

🕙 Automate Tests to Check for Breaking Changes

So now that we've ensured that our tests are valid for all environments, we need to make sure that they are run before we release. Whether it's for every commit pushed or just on pull request level before merging in changes, running our tests allow us to spot failing changes quickly.

Sometimes, of course, we can't just run all tests for every change. Especially long-running integration tests might be scheduled to run once a day. Often you're able to run those locally though, so that might solve the issue. It's still highly-important to make sure critical paths are defined and tested so your most-used features don't break. It might just be a collection of focused browser tests that do the job.

Guillermo Rauch of Vercel (formerly ZEIT) recently published a great post outlining how modern end-to-end tests can be utilized to empower a highly-performant and inexpensive testing foundation for your web applications.

⏮ Add Rollback Capabilities (To Undo a Failing Release)

Now that we tested and deployed our code with confidence, we've arrived at the last point I want to emphasize for this post.

Everything seems to be running smoothly but suddenly you hear a flood of notification sounds coming in. A customer notifies you that your product is breaking for them. Then comes another. And another. Some of the customers need this resolved urgently, you might have enraged a high-profile enterprise account who bought an SLA and is expecting that you deliver.

This is probably the worst feeling of all. No matter the precautions, something just broke production. It shouldn't have happened, but it did. So now we need to keep a cool head, roll back the changes we just deployed (this might have been completely automated) while simultaneously ensuring the ever-growing mob of customers that you're working on it and a fix will be resolved soon.

Shipping a real fix, however, might take too long as you'd have to assess what broke, investigate why it did and test and ship another iteration to fix the issue. So sometimes, we need to roll back for good, undoing the changes leading up to this.

If we spent some time streamlining our deployment process, this might be as simple as clicking a few buttons, in other cases, it might require more work than that. Planning for this scenario upfront will save you a lot of stress later on, even if you're lucky enough that this will never happen to you.


In conclusion, when building products and services used by people all over the globe, in different locations and timezones, you'll have to create workflows that scale with this requirement. Being able to roll out changes quickly, ensuring that you don't ship broken code, and making your infrastructure easy to grasp, will help you with this.

As with all things, it's often necessary to experience it first-hand, but maybe this post could give you some ideas of how you could improve your team's processes. If that's the case, or you've got feedback, please reach out to me on Twitter or by mail.