Handling Long-Running Jobs with AWS SQS

Decoupling services usually involves some form of communication, preferably asynchronous through queues that are populated by one service and consumed by another. AWS SQS offers regular and FIFO queues that help you with exactly that use case, but, as always, the devil is in the details.

Depending on how long your messages take to be processed, SQS may incorrectly assume a message was dropped and re-deliver it, leading to inconsistencies. Let’s take a step back for a moment and try to understand how SQS works, then figure out how we can process messages over a longer duration that is not known upfront.

Distributed Queues

AWS SQS is a distributed system that manages virtual queues to which you can send messages. Applications actively pull messages from the queue and whenever your application receives a message, an internal timer is started within the queue after which the message will be made available to be received by other consumers. The duration after which the message becomes available again is the visibility timeout. After a message is processed, it must be deleted from the queue by the consumer.

Regular queues are optimized for high throughput and do not guarantee a consistent order of messages. This is useful for worker functionality that does not rely on a specific order. Queues of this type follow the At-Least-Once Delivery guarantee: Occasionally, you may receive duplicate messages, so you must handle deduplication in the consumer.

On the other hand, we have FIFO queues: As the name implies, message order matters in FIFO queues. FIFO queues offer exactly-once processing (automatic queue-level deduplication for sent messages within five minutes of sending a message with the same deduplication ID). This is useful when your application must receive the messages in the same order that you sent them in. This system is multi-tenant by default, meaning that you have to specify a message group identifier for which the order is kept consistent. Multiple groups may be consumed concurrently, but messages from a specific group will always be received in order.

Why SQS?

It is important to know that AWS SQS is not the only service related to queues and message processing. There are also Amazon MQ (hosted RabbitMQ), Amazon SNS (Pub/Sub), Amazon Kinesis (streaming), and Amazon MSK (managed Kafka).

If you move away from message-based processing, you might also want to look into batch processing with AWS Batch, which can make more sense for your use case depending on the requirements.

Batch can be especially useful when you want to spin up dedicated workloads for a given task and want just-in-time provisioning of resources rather than always hosting workers waiting for messages, even if no activity is going on. On the flip side, workers will receive messages faster than a batch job is spun up.

We’re going for SQS as the most straightforward service for queues between services. Most of the lessons are probably applicable to most of the other related systems.

Receiving and Acknowledging Messages

Now that we’ve got a rough grasp of the two queue types available, let us think for a moment about how messages should be processed.

Typically, your application will receive one or more messages from the queue. For regular queues, you’ll receive a random selection of available messages up to the specified limit. For FIFO queues, you will receive a number of messages which are ordered in case multiple messages for the same message group are received.

When receiving messages for a specific group, no more messages for the same group are returned until you delete the message or it becomes visible after the visibility timeout. This makes sense because you don’t want another consumer to receive new messages for a given group when you’re still processing older messages of the same group.

Consistent Message Ordering

Often, your application will start multiple virtual workers concurrently to process messages (as a way of application-level load-balancing) and this is the first stage where message order could be messed up: If you receive messages from a FIFO queue but start multiple workers, CPU scheduling and other external factors can lead to inconsistent processing order.

To make sure that messages are truly processed in order, even when using a FIFO queue, make sure you understand every step of the processing workflow once the message is received regarding concurrency. The slowest but safest way of handling this would probably be to receive just one message, process it, then receive the next. You can also spin up multiple workers, each of which receive and process a single message, as SQS will only return one message of the same group until deleted or re-delivered.

Re-Deliveries

In case a message is not processed and deleted from the queue within the visibility timeout, it becomes visible to other consumers again. While SQS does not actively (re-)deliver messages, failing to process the message in time will lead to other consumers receiving the message.

There are two scenarios: In case there was a legitimate error on the side of the consumer, you may want to process the message again. On the other hand, the consumer may just take a bit longer to process the message and a second consumer receiving the message while the first one is still working on it can lead to trouble down the road.

We’ll go over ways to solve this issue after a quick introduction to dead-letter queues and redrive policies.

Dead-Letter Queues and Redrive Policies

When messages cannot be processed even in multiple repeated attempts, you may want to move them to a different place for monitoring and debugging purposes. This is exactly what dead-letter queues (DLQs) are for. Dead-letter queues must be created explicitly and configured with the redrive policy, which decides when messages should be moved over to them (and thus deleted from the original/source queue).

Usually, the condition for moving messages to the DLQ is exceeding the defined number of maximum receives. This value must be high enough so that temporary issues do not cause messages to get “lost” in the DLQ, but low enough so that invalid/non-processable messages do not block the queue.

Dead-letter queues must be of the same type, so a FIFO queue gets a FIFO DLQ and regular queues get regular DLQs.

Now that we’ve set the stage, we can finally think about long-running processing.

Long-running jobs

When you don’t exactly know the duration of processing a specific message upfront, you may get the idea to set a high visibility timeout. Naturally, this will allow you to spend more time on processing, but there are a couple of pitfalls with this approach. For one, when something goes wrong, it will take a considerable amount of time until the next consumer receives the message to retry the operation. In FIFO queues, this can slow things down unnecessarily. In other cases, even your best estimate doesn’t account for unforeseen delays and your message becomes visible and is unexpectedly processed by a second consumer. That’s bad.

You could, of course, create a mechanism that will notice that the first consumer is already processing the message, but once you receive the re-delivery, what do you do with it? You can’t just wait, because the message will be re-sent again and potentially moved into a DLQ if you exceed the number of receives. You also cannot put the message back into the queue in case you care about ordering, because there’s no feature to tell SQS “please take the message back and re-send it in 10 minutes”, or is there?

If we think back, we’re running against time here. Once received by a consumer, the message is regarded as in-flight by SQS and the visibility timeout starts ticking down. Once we reach 0, the message is visible to other consumers again. So how do we tell SQS not to make the message visible if we need more time?

Luckily, we can perform the ChangeMessageVisibility action using the receipt handle we get when receiving the message. Invoking this action behaves like a “set visibility timeout to <specified> seconds”, meaning that the current timer is replaced by a completely new timeout you provide (so it does not add the number on top of the current timeout).

From the docs,

The new timeout period takes effect from the time you call the ChangeMessageVisibility action. In addition, the new timeout period applies only to the particular receipt of the message. ChangeMessageVisibility doesn't affect the timeout of later receipts of the message or later queues.

There’s a hard limit of 12 hours after which a message can no longer be extended and becomes visible.

In addition to updating the time needed, we can also tell SQS that we want another consumer to receive the message instead by calling ChangeMessageVisibility with a new visibility timeout of 0. This will still count towards the allowed number of receives, so be careful.

With all this in mind, we can improve the reliability of long-running jobs with a few tweaks.

Receive and process one message at a time. Every consumer should receive just one message that will be processed. You can start multiple consumers in the same process concurrently, just make sure to call ReceiveMessage with MaxNumberOfMessages equal to 1 (the default).

Continuously update the visibility timeout until done. Once you start processing, create a routine in the background to keep updating the visibility timeout to ChangeMessageVisibility every minute or two to a value greater than that so you get a margin of safety in case something doesn’t work. Don’t choose values that are too high so that re-deliveries are quick in case of a real error.

This heartbeat approach is recommended by AWS and allows you to process messages for up to 12 hours (the maximum visibility timeout) even when you’re not sure about the actual processing time requirements upfront.

With this solution, the only reason why a message is received by another consumer is that the first consumer failed to process it. This is the expected outcome, as you can be sure the original consumer is no longer working on the message and you can safely retry or discard it as you wish.

This approach also does not lead to DLQs overflowing because the jobs just aren’t completed in time.

If your dead-letter queue keeps growing or you notice unwanted re-deliveries, make sure that

you aren’t fetching multiple messages but only adding a heartbeat for the currently-processed message
the heartbeat works properly and handles failures with a retry
the message is deleted after processing

This was a lot, but we’ve learned the most important details about how SQS works, and how we can handle long-running tasks with an unknown duration.