Querying databases often happens in multiple network roundtrips. Especially when you have multiple operations that depend on the outcome of earlier queries, you’re dealing with state.
In most cases, those network roundtrips are negligible, but what if you wanted to reduce it down to sending just one request containing all queries, and receive just one response in return? This is where batching comes in: Most PostgreSQL drivers offer some functionality to send all queries off to the server, then read the response for each statement.
But if we were to go down this route, we couldn’t have any operations depending on earlier ones because we send off all queries at once, without an option to change the control flow later on, right?
This is where a whole world of PostgreSQL tools opens up if you are ready to change your mental model about how database communication should happen, and about the role the database takes in your stack.
Offloading the program flow into the database makes your database a place where much more computing happens, compared to just processing queries and handing them back.
This approach naturally has benefits and drawbacks: Without network latency, your requests should see a performance gain, the difference does depend strongly on your infrastructure setup. On the flip side, moving all the logic into your database not only makes you depend more on it, but some parts become much harder to test, debug, and reason about if something goes wrong.
In most cases, I would be careful of making such a decision, but if you want to go all-in on your database capabilities, let’s get right into it!
Assessing our options
PostgreSQL has several ways to deal with more stateful computation. For one, there is PL/pgSQL, a procedural language in your database instance, allowing you to write functions and procedures, which can use control structures, perform complex computations, and work with anything in your database.
While we could go all-in on PL/pgSQL, usually, functions need to be created in the database in advance. This yields a different model of building and deploying services around the database, as you’d need to update the functions in place, which might break existing deployments, and cause a lot of headaches along the way.
Another idea is to use temporary tables for storing state to use later on. While you might think this sounds crazy, temporary tables can be created within transactions, and are removed at the end of the transaction, leaving no trace behind. Furthermore, all queries can be drafted by an external service, sent off to the database instance as one batch, processed in PostgreSQL, and returned to the service again.
This gives us the best of both worlds: Fewer roundtrips by harnessing the power of SQL, while being able to deploy new services while old ones are still running, without depending on any functions existing in the database in advance.
Side note: You might think whether we could just create temporary PL/pgSQL functions instead, and you could actually run
DO
to process an anonymous code block. I haven’t used PL/pgSQL a lot, though, so I’ll focus on temporary tables in this guide.
Temporary Tables all the way down
Creating a temporary table within a transaction is as easy as sending
CREATE TEMPORARY TABLE "state" (
...
) ON COMMIT DROP;
This will create a new table called state
(make sure you don’t have a table with this name already!), which will be deleted at the end of a transaction block.
For performance reasons, the PostgreSQL documentation states that you should run ANALYZE
on your temporary table once, in case you’re dealing with complex queries, as the autovacuum daemon cannot access temporary tables.
In addition to this, you should be aware of the number of temporary tables being created in a given transaction: I highly recommend re-using existing tables rather than creating dedicated ones for similar use cases, as you will notice PostgreSQL spends a lot of CPU time creating the tables. Reducing the number of temporary tables considerably lowers resource consumption, making temporary tables relatively cheap and viable as a strategy for storing state during a transaction.
A use case for state: Filtering and updating
You might still be wondering what temporary tables could actually be useful for, so let us explore an example I have been working through many times: Updating and later returning one or more rows identified by the value being updated. Imagine we have the following table and content
CREATE TABLE "post" (
"id" varchar(32) NOT NULL,
"title" varchar(512) NOT NULL,
CONSTRAINT "post_pkey" PRIMARY KEY ("id")
);
INSERT INTO "post" ("id", "title")
VALUES (
'24jXOt0cBpa1lX90bhwmNoLyEsy',
'📖 Paginating Large, Ordered Data Sets'
);
Now imagine we want to perform the following operations
UPDATE "post"
SET "title" = 'pagination guide'
WHERE "title" = '📖 Paginating Large, Ordered Data Sets';
SELECT * FROM "post" WHERE "title" = '📖 Paginating Large, Ordered Data Sets';
While the UPDATE
statement will work as expected and update the post, the SELECT
statement will not return any rows because the title has changed in the meantime. Of course, we could alter the second statement to
SELECT * FROM "post" WHERE "title" = 'pagination guide';
reflecting the updated value, but replacing these values isn’t always as easy, particularly when dealing with larger filters.
Another approach is to use a temporary table to look up the post(s) we want to modify first and store their IDs which will never change, then use those to perform both the update and retrieval later on. This way, we convert the variable title value into a constant identifier.
CREATE TEMPORARY TABLE "state" (
"id" varchar(32),
PRIMARY KEY ("id")
) ON COMMIT DROP;
-- resolve filter
INSERT INTO "state" ("id")
SELECT "id" FROM "post" WHERE "title" = '📖 Paginating Large, Ordered Data Sets'
ON CONFLICT ("id") DO NOTHING;
UPDATE "post" SET "title" = 'pagination guide'
WHERE "id" IN (SELECT "id" FROM "state");
SELECT * FROM "post" WHERE "id" IN (SELECT "id" FROM "state");
-- clear temporary table (only necessary when reused)
DELETE FROM "state" WHERE true;
ROLLBACK;
With this, we create one temporary table per transaction for this query, which holds all posts we plan to update and retrieve. We only need to run our filter once (which improves performance considerably for larger filters), and we will always return the documents we modified.
If we plan on using the same temporary table more than once, we can either just drop all rows after we’re done with one part of the query (so any subsequent operations start off from a clean slate), or introduce namespaces if we’re running nested operations:
CREATE TEMPORARY TABLE "resolved_ids" (
"namespace" varchar(64),
"id" varchar(32),
PRIMARY KEY ("namespace", "id")
) ON COMMIT DROP;
INSERT INTO "resolved_ids" ("namespace", "id")
SELECT 'update-1', "id" FROM "post"
WHERE "title" = '📖 Paginating Large, Ordered Data Sets'
ON CONFLICT ("namespace", "id") DO NOTHING;
UPDATE "post" SET "title" = 'pagination guide'
WHERE "id" IN (SELECT "id" FROM "resolved_ids" WHERE "namespace" = 'update-1');
SELECT * FROM "post" WHERE "id" IN (SELECT "id" FROM "resolved_ids"
WHERE "namespace" = 'update-1');
DELETE FROM "resolved_ids" WHERE "namespace" = 'update-1';
With namespaces, multiple operations can use the same temporary table, reducing the number of tables you need to maintain, which reduces the resource footprint.
Programming in a database
By now, you might have wondered, aren’t we just reinventing the same programming environment we’re used to in a database? And in a sense, that is true. If we take our previous query, in a traditional environment we could have written something like
const ids = resolveIds(filter: { title: "Paginating Large, Ordered Data Sets"});
updatePosts(data: { title: "pagination guide" }, where: { id_in: ids });
const data = retrievePosts({ id_in: ids })
With this mental model, temporary tables roughly represent variables (arbitrary state) which is kept around for the remainder of the operation. All operations are then performed on the resolved identifiers, rather than evaluating the filter every time, possibly returning the wrong posts after updating.
I’m still torn between the two approaches to using a database: I fully prefer a boring approach that you can test and debug, but there’s a certain elegance in making the database work for you, rather than just spitting out data and performing some elemental data modifications. For the sake of your team, I would advise taking the cost of roundtrips into account to write more maintainable software, there are probably quite a few queries you can simplify or remove completely before having to decide offloading more work into the database. But if you’re determined to go down this road, you can build a lot of, admittedly cool but crazy, things.