Hey there 👋 I would like to quickly plug a publication I am working on to help teams build better AI-enabled products. If you are building a software product and want to integrate LLMs or make sure you're ready for going to production, make sure to check it out!
This post is part of the Behind the Scenes series on building the first iteration of Anzu. While we’re working hard on the upcoming iteration, I felt it would be useful to look back on the achievements and downsides of the first version of our product. If you didn’t get to read my previous post discussing the problem statement we intended to solve, check it out.
Existing IaC tools are distributed as command line clients, which means that you, the developer, need to take care of running deployments, both locally and in CI. Configuring all of this properly takes time that we wanted our users to spend on building their product instead, so we set out to build a managed platform to help teams build faster.
For the first iteration of Anzu, this meant building the system around providers that would supply cloud resources for our users to deploy. Similar to Terraform and Pulumi, these providers could be built and maintained by a community, but since we planned on running everything in a managed environment to guarantee fast deployments, we had to ensure that the experience would be snappy and secure at the same time.
One issue with managed workloads is that you have to fight off bad actors trying to wreak havoc, so it’s always better if you don’t have to come anywhere near running untrusted code. For us, that wasn’t an option so we looked into ways of running user code on our infrastructure.
While we started with ideas of running workloads isolated within Firecracker micro-VMs or gVisor containers, we decided that the priority was to move fast, and spending the time necessary to create a hardened environment upfront didn’t seem like a value-add, so we continued our exploration.
We knew that deployments would be one-off jobs, not continuously-running services, so running containers in bursts rather than at all times would be completely fine as well, narrowing our options down to on-demand instances and containers.
Almost every public cloud provider offers single-use containers, with additional layers of management. We closely compared Google Cloud Run, AWS ECS (Fargate), and AWS Batch (using AWS ECS with Fargate containers). Google Cloud Run uses gVisor under the hood, while AWS Fargate uses Firecracker.
The main benefits of AWS Batch over pure container hosting services like ECS and Cloud Run are that it supports retrying failed jobs automatically and enqueues jobs to support controlled parallelism.
Implementing the first iteration was rather easy: Every time a deployment was requested, we simply started a container running our deployment worker service. Once started up, it would download all required providers from a shared bucket and proceed to run the deployment. Every container would be completely isolated from other workloads to prevent noisy neighbour effects.
This approach worked fairly well, but one issue that surfaced almost immediately was the considerable startup times. From sending the API request to AWS to starting the process, it often took between one and two minutes, a delay we couldn’t influence. This crushed the idea of instantaneous deployments, so we debated whether the additional overhead of queuing and retrying jobs was worth the delay, and agreed that we should evaluate other options.
Starting a container directly on AWS ECS completely eliminated the delay that queuing added in AWS Batch, so we only incurred a couple of seconds to pull the deployment worker image. We added this deployment strategy to the existing Batch-based deployment so that we could decide which launch type to use at run time.
Looking back, there were more options on the table, including Fly Machines, and Firecracker-based micro-VMs as a service that would have given us isolation and immediate startup times, at the cost of another vendor and potentially higher prices (though nothing comes close to AWS in terms of burning money on cloud services).
We could have spent more time on self-hosting untrusted workloads too. gVisor on our own infrastructure could have been a viable combination that would have dramatically lowered per-deployment costs while increasing our baseline cloud spend. On top of that, spending time monitoring clusters and putting out fires wasn’t something we wanted to worry about, so choosing a managed service was more important for us.
While our containers were running in isolation, even one instance of escalating privileges could have compromised our production account running all workloads. Not only that, even the case of starting too many containers could have starved resource quotas, leading to production deployments being unable to startup after a release.
For this reason, we made sure to offload all user-related resources and containers to a separate account, creating another safety barrier, just in case.
In today’s world of public clouds and smaller service providers, it’s really easy to run untrusted code on demand. You don’t have to get your hands dirty maintaining infinite YAML files and understanding Raft consensus protocols, which is nice. At the end, which provider you chose mostly boils down to your requirements. Do you need to provision long-running or one-off workloads? Do you need to expose services to the internet or do you just need worker-style services? Do you need auto-scaling or persistence? You can follow all these questions and factor in a rough price you’re willing to pay and build the rest.
But while it’s easy to make user code run in the cloud, you might not need to offer such a powerful tool at all. With the next iteration of Anzu, we’re focusing on helping teams move faster with powerful building blocks they can integrate with their applications. While we might offer managed deployments in the future, we want to let teams choose their service provider of choice for hosting while we focus on generating value in other parts of the stack. This lets us move much more quickly right now.