Sep 19, 2021

Cutting Production Release Duration By 80%

The faster you can release changes, the faster you can ship new features, fix bugs, and solve incidents. But there's more: The less time it takes to think about deploying to production, the more time your engineers get back to work on meaningful tasks or spend their time elsewhere.

Viewing the problem from this angle shows how valuable it can be to reduce the time it takes from a pull request (substitute this with your respective concept if you have a different flow) to production services running on the new version, with every infrastructure component in place.

Before we get to. If you can't wait to see what we came up with, and how we improved our time from PR to production by 80%, skip to Making it really fast.

The old days: Manual Labour

In the old days, when infrastructure was a different story, you provisioned most resources by hand. This would work fine when you didn't regularly change the services your product was using, for example setting up an S3 bucket once.

Back then, you might have set up workloads on your own infrastructure as well, putting it all together with some scripts or a deployment system.

These times are long gone now, and we usually deploy containers in some managed environment, like AWS ECS, Google Cloud Run. To expose our services to the internet, we need to deploy load balancers and configure how they connect to our service. And that doesn't include our database yet. If we run a multi-tenant setup with multiple regions, we might have to replicate these setup steps multiple times, making sure they match up.

This is almost impossible. That's one of the reasons why a new category of tooling has emerged in the last years, which makes it much easier to define which services you want to use, and lets the tooling take care of the rest. Enter infrastructure-as-code (IaC in short).

A better approach: Infrastructure-as-Code

When using multiple interconnected services from one or multiple cloud providers, infrastructure-as-code tooling is what will keep you sane.

The concept is simple

  1. you declare the services you want to use in some form of configuration
  2. you run the IaC tool which will
  3. compare the actual infrastructure with your instructions and
  4. try to get to the state you described, without any further input

The most popular tool is Terraform by HashiCorp. It uses their in-house configuration language Hcl, which I found quite hard to wrap my head around when I started. Fortunately, another provider tackles the same problem in a more developer-friendly way. Pulumi allows you to declare the infrastructure as actual code, be it JavaScript/TypeScript, Python, or Go, and then takes care of the rest.

Early last year, my team at Hygraph switched to writing infrastructure as code with Pulumi and since then, we have imported most of our existing infrastructure to be managed with Pulumi.

Due to a number of reasons, we maintain a lot of stacks, each describing an instance of our complete production infrastructure. Whenever we want to deploy a release, these stacks receive the version we want to run on and apply those changes to the upstream infrastructure.

We could run those updates manually, but let's face it, doing this with more than a couple of stacks gets really tedious and slow. We needed a faster solution.

Even better: Applying infrastructure updates in CI

Pulumi is completely portable and provides everything you need to run it in your CI system. We use GitHub Actions for everything, to have an all-in-one workflow.

When we open a pull request to release a new change to our infrastructure, we'll get a preview of changes to be applied. This is important because it gives us confidence that nothing will break. Once we're ready to go, we'll merge the pull request, which results in a push to the main branch. This is detected by another GitHub Actions workflow that applies the changes we saw in the preview.

When we started using Pulumi, we had less knowledge and fewer guides to work with, so we ran Pulumi in Docker within GitHub Actions, supplying the command and args like this

- uses: docker://pulumi/actions
  with:
	  args: preview --stack ...

This worked out as expected, but it was relatively slow. Each pulumi up usually took between 7 and 15 minutes, and as the number of stacks grew, we ended up with cumulative times of more than four hours of CI time for every release.

Not only could we have invested the time to build and improve the product, but every minute of that long-running release was also billed.

And if that wasn't enough, we deployed more than 220 releases (not including infrastructure changes) since automating Pulumi deployments, so you can imagine that we accumulated a lot of CI minutes.

Last week we revisited the setup, and with the help of my amazing colleagues, we made huge improvements.

Making it really fast

Pulumi has seen a lot of changes since we started using it, the release of Pulumi 3.0 being one of them. The GitHub Action maintained by the core team has also changed a lot in the meantime.

Between the lines of huge changelogs we repeatedly saw performance improvements and decided that to keep our setup maintainable and speed up deployment time, there was no way around upgrading Pulumi.

And so we did just that, thinking that it'd be done in a couple of hours. It wasn't quite that easy.

Improving the Github Action

We started by swapping out the previous GitHub Actions workflow with an improved version that made use of the most recent release of the Pulumi action.

Since we run an Actions job for each stack, we wanted to reduce copied code and found out about composite actions. To make it quick, you can use composite actions to reuse workflow steps, while allowing to receive dynamic inputs.

We created a local Pulumi composite action that contained everything the official guide and action docs recommended.

inputs:
  command:
    description: 'Pulumi command to run'
  stack:
    description: 'Pulumi stack to use'
runs:
  using: 'composite'
  steps:
    - uses: actions/setup-go@v2
      with:
        go-version: '1.17'
    - run: go mod download
      shell: bash
    - uses: pulumi/actions@v3
      with:
        command: ${{ inputs.command }}
        stack-name: ${{ inputs.stack }}

We added the composite action to the same repository containing our Pulumi code, and we can use it in any workflow

steps:
  - uses: actions/checkout@v2
  - uses: ./.github/actions/pulumi
    with:
      command: up
      stack: <stack name>

To speed up the action even further, we made sure to cache not only the go modules used in our Pulumi code (note: if you use Pulumi with TypeScript, you'd cache your node_modules), but also the Pulumi plugins which are downloaded for the providers you use (e.g. AWS).

- uses: actions/cache@v2
      with:
        path: |
          ~/go/pkg/mod
          ~/.cache/go-build
          ~/.pulumi/plugins
        key: ${{ runner.os }}-go-${{ hashFiles('go.sum') }}
        restore-keys: |
          ${{ runner.os }}-go-

We also made some other minor adjustments to the action, which are unrelated to this guide.

Upgrading major versions

With the new GitHub action workflows in place, we still needed to upgrade our Pulumi code to use the newest SDKs. As an example, we're using a lot of AWS services, and so we needed to upgrade the pulumi-aws-sdk from v3 to v4.

Upgrading the SDKs was relatively easy since we merely had to change imports from the old version to the newer one. Once we had all imports swapped out, our go.mod would highlight the old version as being unused, then we tidied it up with go mod tidy.

We did not have to fix a lot of places, as most resources did not change. The biggest headache by far was fixing ElastiCache.

ElastiCache trouble

ElastiCache is the AWS solution for running a managed Redis cluster. Unfortunately, between our major versions, the upstream Terraform provider was merged.

If this doesn't make sense yet, keep on reading: Since we want to declare our infrastructure as code, we need bindings that allow the IaC tool to manage the actual resources running in the respective cloud provider. For almost every big cloud provider out there, the Terraform community already manages so-called providers, which Pulumi uses with some additional code generation instead of rewriting everything from scratch.

This means, that when AWS changes something in their APIs, Terraform updates terraform-provider-aws, which Pulumi then uses to generate actual code bindings for these resources.

In our case, ElastiCache changed in two ways

First, the property snapshotArns, which we did not use, changed in the upstream Terraform provider to be a single-item list of strings. Then Pulumi upgraded the provider, which included the change, and stored snapshotArns as a single string.

Pulumi stores the underlying state of your infrastructure in a backend of choice, defaulting to the managed service they host. This state file can be exported as JSON and contained snapshotArns as the previous empty list of items.

Whenever we attempted to preview or apply a change, this would fail because Pulumi could not convert the list of items in the state to the string it expected.

The only way to fix this was to edit the state, an operation that should be the last resort, because it can corrupt your infrastructure state.

The other change was an ongoing issue with the engineVersion (defining which Redis version to run). This bug has a long history, and is still open. What happened was that AWS decided only to allow a format of <majorVersion>.x(e.g. 6.x) to be passed when creating an ElastiCache cluster, but would return the exact version.

This resulted in constant changes because the IaC tools would be under the impression that we had run into a mismatch between the expected version 6.x and the existing version 6.0.0, attempting to update it, which AWS would not allow, of course.

The only sustainable fix there was to ignore changes to that property, so we just created new clusters on 6.x and told Pulumi not to care about AWS returning another version. Not great, but it could be solved for now.

Provider versions

While digging through the state, we saw another interesting behaviour of Pulumi: While we had updated the SDKs to a new major version, the stack wasn't applied yet due to the issues. This also meant that the provider versions were still the previous ones (e.g. aws 3.38.1 instead of 4.20.0).

We found out that applying the changes once would also update the provider version.

It's fast now!

With all the changes in place, each pulumi up takes between 1 to 3 minutes, which equals a quarter of our previous duration in most runs, or 80% less time spent on running releases. Due to this, cumulative release durations have improved from 4+ hours to 1.5 hours at max.

Upgrading the providers improved baseline performance by a lot, and caching Go modules and provider plugins shaved off another two minutes each run.

Now we can release way more frequently, have sane CI bills, and can focus on building and improving our product, rather than spending our time waiting for releases to roll out.

Since all changes result from improvements in the broader ecosystem, expect to hear more about our journey with IaC tooling regularly in the future!


If you found this story exciting and you're interested in scaling the infrastructure that handles billions of requests every month to deliver your content via custom GraphQL APIs, we're hiring!