Transactions are an amazing concept: You perform some work, and only if everything succeeds, the results get persisted. If something in between fails, the system will simply roll back and thus undo all changes. Until a transaction is committed, no one can see its changes.
Most relational databases support transactions out of the box and if your application performs more than just one operation, it should probably use transactions, too. When using Postgres, every statement runs in a transaction, even if you run just a single query.
If we just interact with a single database, that's great! We can be sure that no remainders of our operation are left when we roll back. But what happens in case another system is combined with our transactions?
What if we need to send a message into a queue at the end of our operation and messages should only be enqueued if, and only if, the transaction was committed? Now we run into the terrain of distributed systems, and there's a lot that can go wrong.
For the remainder of this post, we'll use the example of a service that handles requests and, if a request was successful, enqueues a message to notify another system of the request.
Thinking about our example, there's more than just one scenario that can occur. If everything works as planned, our service
Let's imagine what a failure in each of these steps could look like.
Naturally, you might think "let's flip the order so we commit first, then send messages!". Okay let's try that
As you can see, regardless of the order, we'll run into issues when our process crashes or operations (partially) fail. Of course, some failures might not be a terrible issue for you. If database integrity is important but it's not critical to omit some notifications, the second example will work. If messages sent out don't rely on the data being committed, also fine.
But in the case that it's critically important to both have the changes persisted and sent all messages after that, we need a different solution.
In our example, we're limited by one important constraint: Messages must be sent if, and only if, the database changes were committed. If the changes are rolled back, no messages should be sent. If we commit, a service failure should not lead to lost messages.
To solve this, we'll use the (transactional) outbox pattern.
An outbox is a database representation of our list of messages that should be sent once the transaction is committed. We're essentially offloading the complexity of deciding when to send the messages and handling service failures to the database by storing the jobs to be run in a table.
Applying the outbox pattern, our new process flow looks like the following
This is a great start. We got rid of the side effect by serializing all changes in a single source of truth, the database, leveraging its complex transaction logic.
A second service or entry point will consume and process all outbox messages. Unfortunately, services fail. What if you publish the message and crash before deleting it from the outbox? The service will restart and try to send the message again, that's bad.
This means that systems consuming our messages must be idempotent, i.e. support sending the same message twice while only processing it exactly once. Implementing this can be done in two ways, either the queue supports exactly-once delivery/message deduplication/idempotency itself, or the consuming service does so.
AWS SQS, for example, supports native deduplication by specifying a message deduplication ID. Sending a message with the same ID in a rolling 5-minute interval will accept but not deliver it to any consumer.
I've implemented a lot of services that would have needed this solution and always wondered if there was a real way to solve it rather than ordering side effects and transaction commits. The outbox pattern does exactly that.
There are different patterns such as distributed transactions (including Two-Phase-Commit), but coordinating multiple services like that just increases complexity, and doesn't necessarily solve our issues. In addition to that, most messaging services don't support distributed transactions.