Sep 01, 2020

Fundamental Design Decisions for Scalable Systems

When building customer-facing systems, especially in the realm of SaaS businesses, it's important to get a couple design aspects of your system right early on, if you want to plan at scale and grow without unexpected side effects.

Scaling out software infrastructure for large customers, especially enterprise clients, requires careful planning to guarantee you can meet your contracts, keep your customers happy and engineering team focused on building the product, not putting out fires left and right.

If you get a few crucial steps right from the beginning, though, you will be able to plan ahead how you grow, rather than getting forced to scale up blindly and losing control, letting SLAs slip and upsetting customers.

🌏 Regional deployments, multi-tenancy and customer isolation

Offering a service that is available globally can be tempting. But unless you're building the next social network, designing your systems to work on the level of synchronizing across data centers and the globe only exposes you to risks that will not be worth taking.

There's a reason why the public cloud providers offer their services on a per-region basis, and among data governance that is resilience when systems fail. Provisioning your infrastructure regionally will deliver performance results you can make guarantees on.

In some cases, customers will demand their data to be stored in a specific location, especially in regulated industries where data security is valued.

Next up is tenancy. When building Software-as-a-Service products, we offer the same product to multiple customers. We could put all customers on one shared deployment (multi-tenancy) of our infrastructure, but that way we're giving up control: What if a customer manages to impact the experience of other users in the same deployment? Do we want free or self-service users to be able to affect the experience of paid or even enterprise customers?

The answer to those questions is, of course, no. If we want to sell a certain experience, we need to ensure that this can be met. Designing our systems to be deployed on a fine-grained subset of our customers, whether it's per-customer or per subscription-type, allows us to isolate users into manageable groups, which can be scaled independently from another.

🪐 Boring Technologies, Exciting Products

As we've already talked about the importance of minimizing risks connected to our infrastructure, whether by deploying into isolated geographic regions or separating customers to meet commitments, we can continue with a similar topic: Which technologies to build on. The technologies you want to base your business on are of course highly-specific to what you want to build, but there are some general recommendations that have proven their worth over time.

If you're not building the next cutting-edge technology yourself, make sure to choose a solid foundation. While it might be tempting to pick database X because it's new, fast, and shiny, chances are that database Y has been around for decades, used by millions of people and will just work. In the majority of cases you will use capabilities that aren't specific to a single product, so in theory, you're free to switch the implementation below without having to rebuild your internals, but sometimes, it does make sense to stick with one solid piece of technology and use it all the way.

While we're at it, transactional security seems to be one of the talking points with both classic database solutions, as well as up and coming products offering global scale without sacrificing ACID properties. If you really want to go with a geo-distributed storage option, do your due diligence if they can keep their promises if you want to avoid surprises later on.

Most decisions in terms of technology should be the result of a risk assessment process so you're clear on questions like long-term maintenance and support of critical components, potential alternatives that can be switched to, and vendor lock-in that might be fine for some things, but not for all areas.

🏛 Enforce strict pagination, never expose unlimited data sets

Now that we've prepared a solid foundation to build on, we need to make sure our service workloads stay stable as well. This means knowing how your data is accessed, and more importantly, setting limits.

Customer-facing API products that return arbitrary sets of data need to be carefully prepared to perform as expected, for read access this means implementing pagination: If we're unsure how many results are available in the system, for example when requesting user-provided content, it is crucial to set clear boundaries as to how many results will be returned at once, and how a client will perform requests to fetch all available content in batches.

Pagination needs to be stable as well, entries that are added along the way should be taken into account while we need to prevent duplicate results, essentially outlining cursor-based pagination on sorted, incremental identifiers. Of course, you can also expose different pagination strategies, as long as they work for your use case and set clear complexity boundaries.

🎛 Define resource limits and service quotas in advance

Limiting the output of our system is only one part of the equation: There are numerous types of usage that should be kept in control. Whether it's the throughput of requests, size of user-supplied data, or other resources our system works with, defining resource limits and quotas upfront allows us to keep usage in check, enabling us to plan ahead when it comes to building features and scaling infrastructure.

If we don't define any quotas and instead allow our users to utilize our product in ways we cannot predict, we cannot make any guarantees as to how the experience will turn out.

🛡 Expose stable APIs, prevent breaking changes

When providing APIs, whether public or internal, we'll have to make design decisions as to how certain resources are exposed. Some concepts might not change throughout the lifecycle of our product, but in today's world of agile development and continuous iteration on fully-digital products, chances are that we'll change a lot.

Planning far into the future can be difficult for features that might not even have been realized as an idea, existing features might change in scope or get superseded by other solutions. In the end, it becomes complex to maintain an ever-changing API, so we have to find ways to manage the complexity without getting slowed down.

If a change is required, make sure to introduce it in a non-breaking way. In most cases, we can publish an additive change, deprecate the previous version and communicate this with our customers, then remove the previous structure at a set point in the future.

As we want to move fast, iterate, and build our product without worrying about multiple running versions, it's crucial to find a balance between introducing new features and phasing out deprecated functionality. Communication is key when it comes to keeping customers consuming our APIs happy.

☑️ Provide pre-defined options over user input where suitable

Last but not least, designing products involves making decisions as to how much power we grant our users. In some instances, we might think about providing full control, for example when picking a color to be used throughout the interface. There are two ways to go about this: We could display a color picker and allow the user to choose any color they like. We could also display a list of colors we verified match our design guidelines, and other criteria, such as accessibility.

This simple example shows the importance of deciding whether a feature should be exposed in a way that allows our user to control the experience versus the design and product team that might have envisioned it specially. Once again, if we want to remain in control of our product, we need to enforce limits, and in this case, going for the pre-defined list of options works out considerably better.

🌊 Conclusion

Building products that are able to grow long-term requires careful planning early on. Assessing and minimizing risk when making decisions concerning infrastructure and technology choices is crucial to ensure a predictable product development lifecycle. Enforcing limits in every aspect of our system and isolating usage to achieve predictable performance is essential for making and adhering to commitments to customers.

Build systems you're able to predict, observe, and control from end-to-end.