Usually, we go through our day focusing on things at work, side projects, or whatever it is that we spend our time on. We don’t even think about why it is that we choose certain tools, services, and architectural decisions over others.
We generally trust that the experts at our favorite cloud provider know what they’re preaching and go with their recommended strategy, often ending up with a universe of interconnected services.
So let’s step on the brakes for a moment and take a look at the way we build, deploy, scale, and maintain software and services.
To get a grip on the degree of complexity we’re dealing with, try answering the following for yourself: “How long does it take for you to add a new service to your system?”, “Can you draw up a rough representation of your current infrastructure?”, and “What do you do when users start complaining about failing requests?”.
Choosing the right building blocks
Through the rise of the Software-as-a-Service model of distributing software, we have gained access to limitless options when it comes to building our services on top of other systems. Like building blocks, we pick and assemble our stack, piece by piece. We have become used to linking services together by hand, stitching up inconsistencies, and shivering whenever an update is announced, fearing that a change may break our systems.
You might think that most issues come from choosing the wrong interface, depending too much on a given service when you could have implemented a vendor-agnostic solution instead. But as always, it’s a matter of balancing trade-offs. We go for managed solutions out of the need to progress and focus on our product instead of building every tool in our toolbox ourselves (imagine a construction team assembling their gear before starting).
In the end, a big part of software engineering is making decisions to meet the business requirements while managing technical debt and the maintainability of our software.
Focusing on velocity
Most people would probably advise building products by rapid iteration based on user feedback. This method is being practiced industry-wide, with media like the Learn Startup rapidly gaining popularity.
Knowing that you should iterate often is only one part of the solution, though. Your organization must be set up for this method of product development, right down to your tech stack. This means reducing friction in every part of the workflow: If it takes your team days to publish a new version of your software, you might prefer to make fewer big adjustments rather than a lot of small changes that compound over time.
Your customers will notice when you solve their problems and help them use your software more productively. When they report an issue you can solve quickly, your team builds much stronger relationships and can compete with much larger companies that aren’t as flexible.
One invaluable power of startups and small companies is that they can outpace incumbents due to a lack of bureaucracy. This doesn’t mean that you shouldn’t focus on quality, but if you want to build the perfect product, you might just end up solving the wrong problem.
Manual labour in the age of automation
We often preach that mundane tasks should be automated but end up writing the same boilerplate code as we always do, whether it’s permissions, user management, billing, database connectivity, service shutdown handling, feature flags, user analytics, the list goes on and on.
If we want to move fast and build our product, maintaining all of the features ourselves can’t be the highest priority. After all, for each of the named areas, there probably exists a company doing nothing else than solving that problem. Of course, we often tell ourselves that we need more control over our system, and in some cases that might be true.
In the end, adding a new service shouldn’t require hours of integration work: Our codebase should be driven by our infrastructure itself. Systems working with code generation like gRPC, Protocol Buffers, Prisma, even Rails ActiveRecord have shown that we can automate the tedious integration work away.
The lack of accessibility with Infrastructure-as-code
Infrastructure-as-Code has brought a lot of improvements to the engineering landscape: Complex deployment configurations can now be brought up in minutes, completely reproducible and automated. Adding a new service is a matter of writing a couple of lines of code interfacing with your cloud provider of choice. The architecture of IaC has made extensibility a first-class effort, and pretty much every popular service provider has support for Terraform and Pulumi.
While code makes it easy to declare your infrastructure, the textual format can be hard to comprehend, especially in larger deployments.
While I believe Infrastructure-as-Code is a very important part of the infrastructure management stack, we can build a lot more tooling to enhance the experience. In the end, it probably won’t be just code or just no-code, but a combination of both utilizing the benefits of each medium.
Observability in chaos
When something doesn’t go as planned and your users start reporting errors, you need to assess the situation and find the root cause of the issue. This process usually involves a combination of sifting through logs and request traces and checking reported errors.
For logs, you’ll often end up in one of two extremes: You either have too few or too many logs. When there are too few, you’ll only get some basic information but potentially miss the place where the issue occurred. When there are too many, you’ll drown in an ocean of logs trying to find a relevant piece of information.
Error reports in combination with traces are often more helpful, as you get a clue as to where an issue is happening given that you made sure errors are properly reported in the first place.
Even then, reproducing an issue might be tricky as you’re lacking sufficient context. Often, you end up with a lot of data that doesn’t support you much when solving issues in production.
Bringing back the fun
I think in the search for scale and clever solutions we have abandoned a lot of the simplicity we wish we’d have nowadays. In most cases, we probably don’t need infinitely-distributed systems, but a system we can understand and leverage to solve real problems out there.
There’s an elegance in systems that fulfill their purpose and can adapt to evolving requirements. When there’s a problem you want to solve and the path to it is straightforward, there’s just more fun to it compared to fighting a system that works against you.
If I already know that I’ll have to work off an annoying checklist of tasks to get just the basic setup working, I’ll probably just do something else. Reducing friction is an investment to committing to work in the future by removing unwanted distractions.