Dec 11, 2022

Storing User-Defined Attributes in ClickHouse with Nested Types

ClickHouse is the up-and-coming solution for processing large-scale analytics data without having to worry about the complex architecture that is usually required. Originally built by Yandex for their web analytics competitor to Google Analytics, ClickHouse has found widespread adoption (Cloudflare, Vercel via SplitBee, Microsoft, Uber, Bytedance, Stellate, and more) for real-time analytics, usually for dashboard-style applications.

While it’s easy to start using ClickHouse (especially with the newly-released managed ClickHouse Cloud), I’ve found the learning curve for proper data modeling and querying steeper than, say, Postgres. Part of it may be that I haven’t used column-oriented databases much, but I think the developer ecosystem around ClickHouse is still at its beginning.

There are many client libraries (usually wrapping the HTTP interface), but no definite tools that have built a sustained advantage over others, which makes starting to use ClickHouse feel more experimental than it should because the database in itself works quite well.

While I’m getting more exposure to ClickHouse for Anzu (especially our Analytics product), I’m trying to document my experience as detailed as possible (both for myself and everyone interested in starting to use ClickHouse). In the end, everyone benefits from a wide range of guides and documentation, as well as a robust ecosystem, and I’d really like to see ClickHouse succeed in the long term, especially the open-source side of the product.

But enough with the preface, in this guide I want to explore how we can model user-defined attributes in analytics settings. Usually, products like Google Analytics and Mixpanel allow developers to send a key-value list of attributes, not all of which are known beforehand.

From a product development perspective, you’d probably like to minimize the unknown attributes, so you can provide a rich experience, but sometimes, your customers will need the possibility to enrich their events with data from their domain.

Storing user-defined attributes

There are many ways to design our database model such that we can store attributes, but we can discard most of them when we need to support an unknown extent of keys and values. When we only have one table for all users, adding 100 columns for different types may work in some cases, but is inherently limited. We could generate a table for each customer and provision columns dynamically, but that would require more effort than we’re in for.

ClickHouse offers a JSON data type, which works well for semi-structured cases with limited cardinality, but if we expect a wide range of attributes, it’s not the best fit. We could also store attributes as stringified JSON and either use string functions or parse JSON at query time (which is surprisingly fast when using simdjson under the hood), but that feels hacky.

So where does that leave us? A lot of people have recommended nested columns. A nested data structure is almost like having an embedded table per column. In fact, nested is a shorthand for creating an array column for each column in the nested structure and validating that all arrays are of the same length when inserting a row.

Imagine we want to store a table of chronological events in ClickHouse, for the purpose of analyzing user behaviour later on.

A very basic version of this table could look like the following:

create table event (
  id UUID default generateUUIDv4(),
  created_at DateTime default now(),
  attributes Nested(
    key String,
    value String
  )
) engine = MergeTree() primary key (id) order by (id)

We won’t go over the choice of the engine, primary, and order key for the remainder of this guide, as covering these in-depth would justify a dedicated guide so for the course of this guide, it’s sufficient to discard the surrounding details.

Our event table expects a simple tuple of id, creation timestamp, and attributes, which are simply arrays of keys and values of the same length.

We can access both the key and value array for a row by using the dot notation as follows:

select attributes.key, attributes.value from event;

-- ["user","action"]	["bruno","account.create"]
-- ["user","action","feature"]	["bruno","webhook.create","webhooks"]
-- ["user","plan","action"]	["bruno","enterprise-very-expensive","plan.subscribe"]

In the following, we’ll see how I inserted the data that was returned just now. After we know how to insert values for our nested data structure, we can check out some queries that could become important later on.

Inserting data with attributes

Since we defined default values for our id and created_at columns, we can focus on the attributes when inserting data. When using the query console, it’s relatively straightforward to insert data by using array literals:

insert into event (attributes.key, attributes.value) values
	(['user', 'action'], ['bruno', 'account.create']),
	(['user', 'action', 'feature'], ['bruno', 'webhook.create', 'webhooks']),
	(['user', 'plan', 'action'], ['bruno', 'enterprise-very-expensive', 'plan.subscribe'])
;

When building a real application, we have to use a different ingestion mechanism, though. ClickHouse supports a wide range of ways to insert data, including uploading CSV files, importing from S3, using an ETL pipeline, or streaming data with Kafka. Since we want to keep it simple (avoid complexity at all costs), we’ll use parameterized queries with our library (or client) of choice.

Most ClickHosue libraries and clients support parameterized queries (similar to prepared statements), in which column values are given placeholders using a simple name and type combination, which should match the expected type. Remember that nested generates arrays of the given type, so we’ll have to address the “virtual” attribute key and value columns using string arrays.

insert into event (attributes.key, attributes.value) values
    ({attributes_key:Array(String)}, {attributes_value:Array(String)});

We’ll then supply the parameters to the query call, and our data will be inserted just like that. Next, we’ll go over some queries that could come in helpful for our analytics use case.

Common queries on attributes

When building our analytics product, we may need to display filters on attributes. Things like “only show the events related to this specific user” or “only show the events sent from this browser type” become much easier once we can show the user which attributes were sent previously.

ClickHouse offers a nice bucket of array functions we can make use of. While we operate on the full data set, you should always add guard rails (like limiting the time range). We also haven’t added a scope (team, workspace, project, etc.) to our events table, so you’ll have to make some adjustments for multi-tenant use cases.

Which attribute keys were sent?

If we’re interested in a list of attribute keys sent to our analytics database, array joins will come in useful. Array joins simply “unwrap” array columns for each row, creating a dedicated row for each value.

-- create row for each key value of all rows
select
    attributes.key
from event
array join attributes;
user
action
user
action
feature
user
plan
action

As you can see, every key we sent pops up. This output format is alright, but we can make some improvements! How about we merge all rows into one array column and one result row? groupArray is an aggregate function that will merge all values into one array for each group. Since we haven’t specified any group clause, all rows will be merged into one array.

-- merge all keys from individual rows into one array again
select
    groupArray(attributes.key)
from event
array join attributes;
["user","action","user","action","feature","user","plan","action"]

Much better! Now we can return a single array instead of having to do the work on the client. But as you can see, we’re currently keeping duplicates. We might not want to show duplicates in a filter selection input, so let’s remove them using the arrayDistinct function.

-- filter out duplicate key values
select
    arrayDistinct(groupArray(attributes.key))
from event
array join attributes;
["user","action","feature","plan"]

Yup, much better.

Which key/value pairs were sent?

Up next, let’s get an overview of all key/value pairs that the analytics database received. Starting simple, we’ll create a tuple from our key and value arrays and unwrap them (so each tuple gets its own row).

-- get key/value tuples as rows
select
    (attributes.key, attributes.value)
from event
array join attributes;
["user","bruno"]
["action","account.create"]
["user","bruno"]
["action","webhook.create"]
["feature","webhooks"]
["user","bruno"]
["plan","enterprise-very-expensive"]
["action","plan.subscribe"]

So far, so good. Every key/value pair of all events takes up a row each. We can perform the same trick as earlier and pull up all values into one array using groupArray.

-- group key/value tuples in one row
select
   groupArray((attributes.key, attributes.value))
from event
array join attributes;
[["user","bruno"],["action","account.create"],["user","bruno"],["action","webhook.create"],["feature","webhooks"],["user","bruno"],["plan","enterprise-very-expensive"],["action","plan.subscribe"]]

Once again, we can remove duplicate values using arrayDistinct.

-- distinct key/value tuples in one row
select
   arrayDistinct(groupArray((attributes.key, attributes.value)))
from event
array join attributes;
[["user","bruno"],["action","account.create"],["action","webhook.create"],["feature","webhooks"],["plan","enterprise-very-expensive"],["action","plan.subscribe"]]

Nice. Let’s do one more round!

For each key, which values were sent?

This one combines what we learned previously: We’ll retrieve a list of unique (distinct) values for each key. Let’s come up with the solution step by step. First, we need to get all keys.

-- get all keys
select
    attributes.key
from event
array join attributes
group by attributes.key
feature
user
plan
action

We used the array join to unwrap our attributes and grouped by key, which allows us to get all attributes in the next step.

To show the values, we need to use an aggregate function that respects our grouping. Since the group is on key, we want a function that gives us all values for the key group. If you thought that groupArray would do just that, you’re right!

-- get all keys grouped by value
select
    attributes.key,
    groupArray(attributes.value)
from event
array join attributes
group by attributes.key
feature	["webhooks"]
user	["bruno","bruno","bruno"]
plan	["enterprise-very-expensive"]
action	["account.create","webhook.create","plan.subscribe"]

That’s… interesting! It seems we had multiple events that used the user:bruno attribute. Let’s tidy this up and remove duplicate entries in our grouped values array.

-- get all keys grouped by value
select
    attributes.key,
    arrayDistinct(groupArray(attributes.value))
from event
array join attributes
group by attributes.key
feature	["webhooks"]
user	["bruno"]
plan	["enterprise-very-expensive"]
action	["account.create","webhook.create","plan.subscribe"]

Much better. ClickHouse will take care of returning our values in no time, and our customers can quickly understand what events were ingested. From here, you can craft more complex queries, using different aggregate functions, or process your array data with array functions.