At Inngest, we serve hundreds of millions of events per day. Each event can trigger a function run comprising multiple steps. Users can return blobs and other data from these steps, necessitating incredibly fast reads and writes to our data store to keep latency as low as possible. Function runs can be delayed up to many months, so this data needs to be persisted over an arbitrary amount of time.
This immense volume of data is stored in our primary Redis-based data store, presenting significant challenges in terms of throughput and latency due to Redis’ single-threaded design. Recently, our main cluster approached critical capacity limits, with peak CPU utilization and dwindling storage space threatening our system’s stability and performance.
Over the past six weeks, we implemented a series of infrastructure improvements that significantly reduced latency and improved long-term scalability. Impressively, we achieved this without a millisecond of downtime. In this post, I’ll give you an overview of the changes we’ve been working on, and focus on a new proxy service in the critical path.
To avoid degrading end-to-end latency and prevent storage capacity issues, we implemented a multi-stage plan focused on enhancing Inngest’s core components. Our first step was to introduce a unified state interface, consolidating all state operations in the critical path in a single place. This way, improvements are propagated across the entire system.
Next, we introduced a coordination service to decouple our services from direct connections to the primary data store. Instead, requests are routed through a lightweight proxy, which supports sharding and caching for better efficiency. Long-term, this architecture allows us to replace the current state store with a completely new backend optimized for Inngest’s access patterns and SLOs.
Let’s delve into the implementation of the state coordinator.
Implementing the State Coordinator
After defining the new state interface, we immediately started work on the new coordination service, routing all state interface requests to the current or future state store. We designed the interfaces with the new service in mind, so the handover was seamless.
Accessing the state store is a critical operation in our system because every function run involves reading and writing events, step outputs, and other metadata. Failure in these operations could result in significant performance degradation across the system.
We designed the state coordinator to offer high throughput at the lowest possible latency. We implemented resilient error handling by retrying transient errors introduced by the network roundtrip, as well as introducing observability on both the client and server side, giving us insights into service performance and potential issues as early as possible.
By having all services connect to very few state coordinator instances, Redis connections are pooled in a single place, making it easy to understand bottlenecks and scale the system accordingly.
The state coordinator was built with robust sharding and caching mechanisms, utilizing consistent or highest random weight (HRW) hashing. This approach ensures uniform data distribution and high cache hit ratios, by routing requests to instances most likely to store the data in question. Sharding can be achieved by applying the same logic on the client to determine which group of state coordinators serving a single state store cluster should be invoked. Adding a new shard is as easy as provisioning another cluster and gracefully reconfiguring existing services.
Rolling out
To ensure a smooth rollout, we devised a strategy to gradually enroll Inngest customers in using the state coordinator. We leveraged the state interface to create a rollout layer using feature flags, enabling controlled, percentage-based rollouts across various customer segments: Internal users, free users, paid customers, and finally, enterprise customers.
To mitigate the risk of unexpected errors, we implemented robust fallback logic, ensuring any failed requests are retried using the previous implementation. In the worst case, the customer wouldn’t even notice that our new service failed, and no data would be lost in the process.
These safeguards gave us the confidence to roll out the new code to production very quickly, as we could flip a switch and fall back within a second. After thoroughly testing the new service internally, we enrolled all non-enterprise customers within a week. During the entire rollout, we tracked rollout progress combined with operation success and latency metrics using a custom dashboard, which made it easy to catch any problems, such as slow or failing requests. In every way, the rollout started smoothly.
After continuing to evaluate our system performance and allocating more compute resources to proactively rule out bottlenecks, we enrolled all enterprise customers without a hitch.
The state coordinator has been running in production for over a month now. On average, we are serving 300k rpm, with a p99 of less than 20ms, including network roundtrip latency. This is a fraction of the traffic going to our primary data store, yet it accounts for the largest response sizes due to user-generated state outputs.
Post-Rollout
Through this project, I gained valuable insights into building and deploying critical infrastructure, designing for gradual customer enrollment, creating effective dashboards for monitoring system health, and effectively communicating project updates with the team.
After conducting the rollout and running the state coordinator in production, there are a couple improvements I’d look into the next time we run a similar project. First, percentage-based rollouts are an amazing tool to slowly ramp up usage to spot bottlenecks early. Simply flipping a switch and routing all production traffic to a new system without safeguards would be reckless at best.
Percentage-based rollouts get more predictable the more uniformly distributed your enrollment key is. This is similar to sharding where the wrong sharding key can cause a skewed data distribution. If you roll out based on account IDs, it’s fairly likely that you encounter a long tail of accounts with little to no usage, with a few accounts making up most of the activity in your system. All account IDs look the same to a rollout system, so it’s possible that you start seeing a huge spike after increasing your percentage by a microscopic amount.
Knowing your product usage metrics is crucial for understanding how choosing a certain enrollment key will affect the system. One solution for this problem could be to create multiple segments based on different historical usage and adjust the entities in the buckets over time, leading to higher predictability.
This has been my first major project to work on over at Inngest, and it’s been incredibly rewarding. Thanks to Jack, Darwin, Tony, and all the other folks on the team for helping out and making this possible!