May 19, 2024

Zero-Downtime Migrations in Producer/Consumer Systems

Imagine you’re building an app and want to ensure that requests don’t feel slow for the end user. Let’s say you’re working on a signup flow that needs to send an activation email and update external systems like billing and a CRM. You don’t really need to block the request until all these steps have run to completion. Instead, you could decouple this workflow into a producer and consumer. When receiving the signup request, your API would simply enqueue a task to some distributed queue, which a worker service could then pick up and process.

After a couple of months, you’ve shipped the initial version of this workflow and requirements have changed. Unfortunately, the worker processing the messages needs to be updated to handle a slightly different format.

You’ve thought long and hard and decided to prepare the following rollout plan: For a short transition period, you’ll ensure that the system keeps processing both old and new messages. This is done by updating the consumer to handle both the new and old message formats. The producer is updated to provide the new format. After all old messages have been processed, you can remove the code handling the old format.

You prepare a pull request including updates to both the producer and consumer and all tests pass. You feel good about handling the problem and deploy to production. You’re using a managed container service and watch while new instances are spinning up. Suddenly, your error rate starts spiking. Messages are failing in the consumer. You begin to sweat a little and start exploring the current logs to find consumer containers complaining about malformed messages. Of course, that’s why you added graceful handling in the pull request! You assume that the new code is broken and decide to roll back. For some reason, the errors keep coming in.

What’s going on? Let’s rewind.

Orchestration systems like Kubernetes usually deploy new containers side by side with old instances. This allows for a graceful, zero-downtime rollout. This also means that two versions of your code are running side by side.

As distributed systems are notoriously hard to predict, there’s no way that you can or should be timing the deployment of different components in one deployment. That’s why you can’t be certain that all consumers are deployed before producers start rolling out.

Let’s go back to the time of deploying your changes.

After sifting through the errors some more, you notice that all logs originate from old containers. Suddenly, everything starts to make sense!

You roll forward to the latest version, wait for a couple of minutes until all old pods have terminated, and the errors disappear.

Let’s recap what went wrong here. First, you expected your changes to roll out in a certain order. With most systems, you do not get these guarantees. Second, you did not handle the case where old consumers were picking up messages enqueued by new producers. You have to be really really careful that new consumers handle old messages and old consumers handle new messages. If either invariant is violated, you’re potentially breaking your system.

So how could this have been avoided?

One possible solution could have been to separate the deployment of producers and consumers. You could have split up your pull request to isolate the changes enabling graceful handling of old messages to the consumer. After deploying this and waiting for all old consumers to terminate, you could have followed up with another deployment updating the producer code to publish new messages. At that point in time, no new messages could have physically reached old consumers.

In any case, the system should have gracefully retried the failing messages. After a couple of attempts, a message should have reached a new consumer and succeeded, even if the initial attempt failed. As long as you never drop a message that wasn’t processed yet, the system is able to recover (albeit after some downtime).

Rolling out changes in distributed systems is a tricky problem because your tests won’t catch this kind of inconsistency. Tests run on the current version of the codebase, which compiles just fine. Ideally, your test runner would know about old code running in production and test producer changes against old and new builds.

Instead of writing all of this yourself, you should be using a solution like Inngest, which happens to be my current employer.