Sep 11, 2022

Anzu Behind the Scenes: Why Resource Management Wasn’t It

This post is the first part of the Behind the Scenes series on building the first iteration of Anzu. While we keep on building and improving the next iteration of Anzu, I felt it would be useful to write down what went well with the first iteration, and what didn’t, so this series focuses on both product and technical achievements made over the course of the past twelve months.

In this first part, I want to set the stage and talk through the events that inspired us to build a resource management platform.

Building (lots of) software products

Over the past three years, my co-founder Tim and I worked on various products, both B2B and B2C software deployed on the web. For these, we went ahead and wrote the fundamentals ourselves, mostly using Node.js, TypeScript, Postgres, and GraphQL. After building the second product, we noticed that even across B2B and B2C software, most requirements were similar.

All applications used similar build and deployment flows, queues for decoupling services, notifications, mailing, analytics, authentication, authorization, logging, and more.

After building these once, copying them over each time we started a new product felt like a regression, so I decided to pull them into a shared library, bricks. This was really nice as a single source of truth, but came with its own set of drawbacks, most notably adding updates for one product required changes to another product.

Of course, this was a sign of problematic design decisions and a failure of abstracting the right parts, but paying that price was still much better than copying over or even rewriting the same foundation again and again. Even nowadays I shiver when thinking about building auth, team management, permissions, and all the other parts that go into a B2B SaaS product.

Building Hygraph

At the same time, I was working on what shaped the current infrastructure at Hygraph. This came at a time when our previous infrastructure was grinding to a halt due to scaling problems. Building new features was almost impossible, and we had to work hard to keep the lights on, as most of our infrastructure depended on a third-party multi-tenant GraphQL endpoint-to-database-layer (can you guess which technology I’m talking about?).

We tried to salvage what we could by adding a new caching layer built on Cloudflare Workers (which helped a lot), but with the hardship of the past months in mind, we decided to build a new iteration of our architecture primarily using Go.

I remember most of the fall and winter of 2019 reading every page of the Postgres docs and learning Go, after which a team of two backend engineers and I created our core service for handling GraphQL requests to a dynamic APIs surface, the Content API. You see, every customer can create their own schema in Hygraph, so we had to build a translation layer from GraphQL queries and mutations to the underlying datastore.

We started the project of rebuilding the architecture in late summer of 2019 and launched the new product in spring of 2020. The new system included features that weren’t possible before, like union types (which only a few to no competitors are able to support to date) and proper localization and publishing.

Going into the details of this architecture change would well exceed the scope of this post, so I’ll save it up for another time, but I think it’s critical to understand what I learned and the pace of how I’ve been building products since then.

At Hygraph, we started managing our infrastructure entirely as code using Pulumi, which was an important upgrade from the script-based and very error-prone approach before. From this move, I saw the importance of versioning and formalizing infrastructure, so no one has to click through dashboards to end up with slight differences that break production.

Identifying unmet needs

After all of this and building more products, I wondered about the pain points I had until then. Pulumi and Terraform did a great job of allowing people to manage their infrastructure, but it didn’t make it any easier to onboard new team members, or to understand what we were running.

This pushed me in the direction of visualizing infrastructure, making it easier for teams to understand which cloud providers and services they were using, at a glance. At the same time, this would eliminate the need for architecture diagrams and would enable different layers of flows, including inspecting and debugging live traffic, messages, and more.

On the other hand, a product purely focused on visualization wasn’t entirely new, many monitoring vendors already tried a similar approach with service maps and graphs.

The important difference, in my mind, was that the source of truth should be your resource graph. When using a monitoring provider, they have to try to understand what you’re running based on very intrusive approaches like monkey-patching every network call in your application. But if you managed the resources, you would have the important data right at your fingertips.

At the same time, I was wondering about complexity. After years of learning different service providers, database technologies, deployment methods, virtualization, and container technologies, build systems, and approximately 100 AWS services with obscure names, I believed that complexity had become a danger to velocity.

I was split about whether the right path was to offer a PaaS-like experience for workloads and services on top, or to build on the fact that teams liked to choose their service providers. Ultimately, I went with the latter.

Building the first iteration

This decision shaped the first iteration of Anzu to be a modular system that managed resources using providers written by the cloud providers or the community. While depending on network effects to build a healthy ecosystem, I believed that providers offered the perfect abstraction to plug different services together, a requirement of today’s world.

I also decided not to build on top of existing IaC providers like Terraform and Pulumi, but to manage resources as part of Anzu and offer other ways of integrating later down the road. This decision was dangerous in multiple ways: I had to make sure that the resource management experience was on par with products that existed for years both in terms of usability but also reliability, and I had to ensure that the first users could use providers offered by Anzu to kick-start the ecosystem.

The architecture was simple to understand and yet powerful enough to support simple deployments and complex ones like Hygraph used. The interface, though, was a challenge. What initially started as a simple UI became complex to use for bigger components with lots of connections.

Deploying a bucket was simple, deploying multiple services, each of which depended on multiple AWS resources quickly became untenable. None of this was the biggest issue, though.

Talking to customers

By now, you’ve probably got the point that I’m a person who likes to build things: I really enjoy pushing progress forward on products, both on the engineering side and on all other ends.

At the same time, I’ve neglected to talk to users and potential customers too often, being almost comically focused on velocity and shipping fast while ignoring the most important part: No matter [1] how amazing your product is, building it is the easiest and least important part. What matters much more is that you talk to people and find your first users. When you have someone using your product, you can focus on gathering feedback and improving, but that implies existing users.

Being an engineer, I often believed that having the problem myself (or the hint of one, finding real unsolved problems is harder than you think, so sometimes you just make them up) was enough, and I get why: I spent my entire career building, not talking to people.

Don’t get me wrong, I really enjoy building, but on some days, I hate that tendency.

I don’t think that the only products that can sustainably exist solve an unmet need, improving existing products and creating value elsewhere is feasible too, even radical innovation can work at times, but all of those depend on talking to users and building a distribution network.

If there’s one thing you to take home (figuratively speaking), let it be that you absolutely must talk to people, or else you’ll fool yourself into endless build cycles. On the other hand, if you can get people to commit to paying you for your product, you’re onto something, and then you can focus on shipping, collecting feedback, and improving as fast as possible.

The consequence of not talking to people

Once we started talking to teams, we quickly found out that resource management was much less of an issue than we expected. Teams had adopted Terraform and the cost of switching didn’t justify moving to Anzu anytime soon, even if we could present a beautiful visualization, code generation, and other features.

This hit hard, but it was the best thing that could have happened to Anzu (and us as founders). Rather than blissfully believing we were building the thing, we were forcefully brought back to reality and knew that something had to change.

Moving on

We accepted that resource management was a done deal and that we would need to focus on a different area where we could help teams instead. We went back to the drawing board and went over areas we thought might still contain unsolved problems.

After some time thinking and researching, it turned out that in the last couple of years, tools got better and while deploying, monitoring and operating applications may have gotten more complex, sure, they got covered by many products and providers.

There was, however, one dimension that most tools didn’t take into account as much as we would have hoped: Velocity. Building products in days to weeks instead of months to years was something we built experience in and believed we could help other teams with.

But as I mentioned above, a vision is rarely sufficient: We are finally talking to teams, building only the first iteration of our building blocks, and (truly) focusing on getting validation for the first time. This journey is far from over, and I’m still forcing myself to stop building and start reaching out to people instead because four weeks of validation are much more valuable than four months of building.

[1] unless you’ve previously built a huge product, have a huge Twitter following, or an existing network of people you can distribute your product to.

Thanks for reading the first part of the Behind the Scenes series on the first iteration of Anzu! In the next parts, I will focus on more of the product and engineering work that created a flexible and reliable resource management platform.