Transactions are an amazing concept: You perform some work, and only if everything succeeds, the results get persisted. If something in between fails, the system will simply roll back and thus undo all changes. Until a transaction is committed, no one can see its changes.
Most relational databases support transactions out of the box and if your application performs more than just one operation, it should probably use transactions, too. When using Postgres, every statement runs in a transaction, even if you run just a single query.
If we just interact with a single database, that's great! We can be sure that no remainders of our operation are left when we roll back. But what happens in case another system is combined with our transactions?
What if we need to send a message into a queue at the end of our operation and messages should only be enqueued if, and only if, the transaction was committed? Now we run into the terrain of distributed systems, and there's a lot that can go wrong.
For the remainder of this post, we'll use the example of a service that handles requests and, if a request was successful, enqueues a message to notify another system of the request.
A lot of ways to fail
Thinking about our example, there's more than just one scenario that can occur. If everything works as planned, our service
- performs some changes in the database
- enqueues our notification messages
- commits the transaction and
Let's imagine what a failure in each of these steps could look like.
- If we run into an issue while performing the initial changes, there's no problem. We'll just roll everything back, and return the error. This works because until now, we haven't issued any side-effects (i.e. interacted with other services in ways that would need to be reverted)
- If we cannot enqueue our notification messages, we should roll back. But what if some of the messages were already sent? Also, this order could lead to timing issues where we haven't committed the transaction yet and the first messages were already received. If the receiving service requires our operation changes to be persisted, we've got an issue.
- Okay so far so good, we performed changes and sent out messages. What if the commit fails? Not a single change has been persisted, but we already notified an external system of a successful operation.
Naturally, you might think "let's flip the order so we commit first, then send messages!". Okay let's try that
- perform some changes in the database: This one either works or gets rolled back, easy.
- commit the changes: If we cannot commit, we'll simply return an error. No messages sent yet, so we don't have to worry.
- enqueue our notification messages: We already committed, so the database persisted all changes. If a single message fails to be delivered, we've got a problem. If the external service relies on being notified for every change, for example, a cache that might require invalidation, we cannot guarantee that fact.
As you can see, regardless of the order, we'll run into issues when our process crashes or operations (partially) fail. Of course, some failures might not be a terrible issue for you. If database integrity is important but it's not critical to omit some notifications, the second example will work. If messages sent out don't rely on the data being committed, also fine.
But in the case that it's critically important to both have the changes persisted and sent all messages after that, we need a different solution.
The best side effects are no side effects
In our example, we're limited by one important constraint: Messages must be sent if, and only if, the database changes were committed. If the changes are rolled back, no messages should be sent. If we commit, a service failure should not lead to lost messages.
To solve this, we'll use the (transactional) outbox pattern.
An outbox is a database representation of our list of messages that should be sent once the transaction is committed. We're essentially offloading the complexity of deciding when to send the messages and handling service failures to the database by storing the jobs to be run in a table.
Applying the outbox pattern, our new process flow looks like the following
- In a transaction in our API service, perform database changes related to the current request/operation
- In the same transaction, store notification messages to be sent in the outbox table.
- Commit and respond to client
- In a different service entry point, check the outbox table for messages that need to be sent, fetch them 1 by 1.
- After fetching a message, publish it to the messaging service.
- After publishing, delete the message from the outbox table
This is a great start. We got rid of the side effect by serializing all changes in a single source of truth, the database, leveraging its complex transaction logic.
A second service or entry point will consume and process all outbox messages. Unfortunately, services fail. What if you publish the message and crash before deleting it from the outbox? The service will restart and try to send the message again, that's bad.
This means that systems consuming our messages must be idempotent, i.e. support sending the same message twice while only processing it exactly once. Implementing this can be done in two ways, either the queue supports exactly-once delivery/message deduplication/idempotency itself, or the consuming service does so.
AWS SQS, for example, supports native deduplication by specifying a message deduplication ID. Sending a message with the same ID in a rolling 5-minute interval will accept but not deliver it to any consumer.
Services like Stripe or Shopify also natively support idempotency.
I've implemented a lot of services that would have needed this solution and always wondered if there was a real way to solve it rather than ordering side effects and transaction commits. The outbox pattern does exactly that.
There are different patterns such as distributed transactions (including Two-Phase-Commit), but coordinating multiple services like that just increases complexity, and doesn't necessarily solve our issues. In addition to that, most messaging services don't support distributed transactions.