Aug 21, 2022

Balancing Normalization and Revision History

From early on, we’ve learned to deduplicate data and use concepts like foreign keys to make the most of our relational databases. Cascading operations are great to handle implicit updates and retain relational integrity, but somewhere, there is a case for denormalization.

Recently, while working on Anzu, I came across a design flaw in the initial spec. With Anzu, we performed a lot of diffing operations and kept deployment plans around, with references to resources. If you created a particular bucket, for example, you could go back to the deploy request and/or deployment that provisioned it and head to the resource from there.

Of course, whenever you modified or deleted that resource, you would also modify the deployment logs, which should be immutable. Suddenly, when dropping a complete environment, all previous deployments were empty. This happened because we cascaded the deletes to the deployment steps which referenced the resources that just got dropped.

This is a prime example of a case where a degree of denormalization is needed: At the time of creating or finalizing the deployment, we should have made sure that all data used by the system in the future (for the purpose of reviewing past actions) was stored separately, as a copy with no links back. While this sacrifices referential integrity, we don’t lose it once the resources are dropped.

This is similar to audit logs or webhooks to an external system, both of which transfer data for the purpose of archival.

Another example of solving this issue is when we built versioning at Hygraph. Back in the previous product version, we did not offer versioning at all, but when building the new version, it made sense from a product perspective.

Content editors should be able to go back and see the changes made to the document over time. Of course, the storage representation of a document may vary over time, field names change, and some fields are deleted and added over time, so we could either store the underlying identifiers with the content or try to give our best effort at reconciling the two states in the web application and store the document snapshot as JSON.

We went for the latter, and it’s mostly solid although it runs into the same difficulties as other archival workflows, where you cannot fully anticipate future product changes that may require different data to be stored.

In most cases, versioning is an emergency brake you can use to get an idea of what changed, especially when you performed unexpected changes. You should make sure not to promise indefinite version storage, a retention of a couple of weeks to months is reasonable, while up to a year and beyond should be reserved for high-value customers as you incur the cost of having to maintain these versions.