Fundamental Design Decisions for Scalable Systems
When building customer-facing systems, especially in the realm of SaaS businesses, it's important to get a couple
design aspects of your system right early on, if you want to plan at scale and grow without unexpected side effects.
Scaling out software infrastructure for large customers, especially enterprise clients, requires careful planning to guarantee
you can meet your contracts, keep your customers happy and engineering team focused on building the product, not putting out
fires left and right.
If you get a few crucial steps right from the beginning, though, you will be able to plan ahead
how you grow, rather than getting forced to scale up blindly and losing control, letting SLAs slip and upsetting customers.
🌏 Regional deployments, multi-tenancy and customer isolation
Offering a service that is available globally can be tempting. But unless you're building the next social network,
designing your systems to work on the level of synchronizing across data centers and the globe only exposes you to
risks that will not be worth taking.
There's a reason why the public cloud providers offer their services on a per-region basis, and among data governance
that is resilience when systems fail. Provisioning your infrastructure regionally will deliver performance results you can
make guarantees on.
In some cases, customers will demand their data to be stored in a specific location, especially in regulated industries
where data security is valued.
Next up is tenancy. When building Software-as-a-Service products, we offer the same product to multiple customers.
We could put all customers on one shared deployment (multi-tenancy) of our infrastructure, but that way we're giving up control: What if a
customer manages to impact the experience of other users in the same deployment? Do we want free or self-service users to be able
to affect the experience of paid or even enterprise customers?
The answer to those questions is, of course, no. If we want to sell a certain experience, we need to ensure that this can be met.
Designing our systems to be deployed on a fine-grained subset of our customers, whether it's per-customer or per subscription-type,
allows us to isolate users into manageable groups, which can be scaled independently from another.
🪐 Boring Technologies, Exciting Products
As we've already talked about the importance of minimizing risks connected to our infrastructure, whether by deploying into
isolated geographic regions or separating customers to meet commitments, we can continue with a similar topic: Which technologies
to build on. The technologies you want to base your business on are of course highly-specific to what you want to build, but
there are some general recommendations that have proven their worth over time.
If you're not building the next cutting-edge technology yourself, make sure to choose a solid foundation. While it might be tempting
to pick database X because it's new, fast, and shiny, chances are that database Y has been around for decades, used by millions
of people and will just work. In the majority of cases you will use capabilities that aren't specific to a single product, so
in theory, you're free to switch the implementation below without having to rebuild your internals, but sometimes,
it does make sense to stick with one solid piece of technology and use it all the way.
While we're at it, transactional security seems to be one of the talking points with both classic database solutions, as well as
up and coming products offering global scale without sacrificing ACID properties. If you really want to go with a geo-distributed
storage option, do your due diligence if they can keep their promises if you want to avoid surprises later on.
Most decisions in terms of technology should be the result of a risk assessment process so you're clear on questions like
long-term maintenance and support of critical components, potential alternatives that can be switched to, and vendor lock-in
that might be fine for some things, but not for all areas.
🏛 Enforce strict pagination, never expose unlimited data sets
Now that we've prepared a solid foundation to build on, we need to make sure our service workloads stay stable as well.
This means knowing how your data is accessed, and more importantly, setting limits.
Customer-facing API products that return arbitrary sets of data need to be carefully prepared to perform as expected, for
read access this means implementing pagination: If we're unsure how many results are available in the system, for example
when requesting user-provided content, it is crucial to set clear boundaries as to how many results will be returned at once,
and how a client will perform requests to fetch all available content in batches.
Pagination needs to be stable as well, entries that are added along the way should be taken into account while we need to prevent
duplicate results, essentially outlining cursor-based pagination on sorted, incremental identifiers. Of course, you can also expose
different pagination strategies, as long as they work for your use case and set clear complexity boundaries.
🎛 Define resource limits and service quotas in advance
Limiting the output of our system is only one part of the equation: There are numerous types of usage that should be kept in control.
Whether it's the throughput of requests, size of user-supplied data, or other resources our system works with, defining resource limits
and quotas upfront allows us to keep usage in check, enabling us to plan ahead when it comes to building features and scaling infrastructure.
If we don't define any quotas and instead allow our users to utilize our product in ways we cannot predict, we cannot make any guarantees as
to how the experience will turn out.
🛡 Expose stable APIs, prevent breaking changes
When providing APIs, whether public or internal, we'll have to make design decisions as to how certain resources are exposed.
Some concepts might not change throughout the lifecycle of our product, but in today's world of agile development and continuous
iteration on fully-digital products, chances are that we'll change a lot.
Planning far into the future can be difficult for features that might not even have been realized as an idea, existing features might change
in scope or get superseded by other solutions. In the end, it becomes complex to maintain an ever-changing API, so we have to find ways
to manage the complexity without getting slowed down.
If a change is required, make sure to introduce it in a non-breaking way. In most cases, we can publish an additive change, deprecate
the previous version and communicate this with our customers, then remove the previous structure at a set point in the future.
As we want to move fast, iterate, and build our product without worrying about multiple running versions, it's crucial to find a balance
between introducing new features and phasing out deprecated functionality. Communication is key when it comes to keeping customers consuming
our APIs happy.
☑️ Provide pre-defined options over user input where suitable
Last but not least, designing products involves making decisions as to how much power we grant our users.
In some instances, we might think about providing full control, for example when picking a color to be used
throughout the interface. There are two ways to go about this: We could display a color picker and allow the
user to choose any color they like. We could also display a list of colors we verified match our design guidelines,
and other criteria, such as accessibility.
This simple example shows the importance of deciding whether a feature should be exposed in a way that allows
our user to control the experience versus the design and product team that might have envisioned it specially.
Once again, if we want to remain in control of our product, we need to enforce limits, and in this case, going for the
pre-defined list of options works out considerably better.
🌊 Conclusion
Building products that are able to grow long-term requires careful planning early on. Assessing and minimizing risk when making
decisions concerning infrastructure and technology choices is crucial to ensure a predictable product development lifecycle.
Enforcing limits in every aspect of our system and isolating usage to achieve predictable performance is essential
for making and adhering to commitments to customers.
Build systems you're able to predict, observe, and control from end-to-end.