Jun 26, 2022

Improving Cache Hit Ratios with Connected DataLoaders

The DataLoader concept has become popular with GraphQL, combining batching and caching of database requests, API calls, and any other resource fetching implementation.

Usually, DataLoader instances are used to fetch entities from a database, using one or more keys. Some DataLoaders receive a single key and return a single or no entity (load entity by ID), others receive a key and return a list of entities (often receiving the parent identifier and returning all child entities).

Using multiple DataLoaders to access the same entity often leads to a variety of caching-related issues: Stale cache reads can occur when you update the entity but not every DataLoader is invalidated accordingly. You can also run into unnecessary database roundtrips when you don’t prime related DataLoaders after fetching an entity. This can create waterfall-like loading behaviour that DataLoaders were created to prevent in the first place.

As software changes over time, it’s easy to forget to edit existing DataLoaders and data access methods when introducing a new DataLoader, which leads to more issues down the road.

We ran into similar issues at Hygraph, where we make heavy use of DataLoaders to batch and cache resources for the duration of a single request. Previously, some operations caused hundreds of requests to be sent back and forth between the service and its database.

After improving cache invalidation, enforcing mutual cache priming, and wrapping all changes in a simple API surface so that everyone would enjoy the benefits without having to think about it, we managed to halve database calls.

As refactoring our DataLoaders meant updating all existing places where we access our databases, we carefully planned the move, discussed several solutions and designs, and decided to go with a shared storage DataLoader system that I’ll outline in the following.

To make an informed decision, we also made sure to measure and compare the performance of the old and new implementation, settling on using database calls and time spent processing those as a proxy, as direct comparisons were complicated by code changes (i.e. switching to using DataLoaders in places we didn’t have them in before led to different access patterns due to batching).

Let’s dive into the changes we made to our DataLoader infrastructure!

An example of the new system

To understand the changes, let’s compare the old workflow with the improved version.

import DataLoader from 'dataloader';

interface Member {
  id: string;
  team: string;
  account: string;
}

describe('dataloader', () => {
  it('should cache members', async () => {
    let byIdLoader: DataLoader<string, Member>;
    let byTeamAndAccountIdLoader: DataLoader<
      { teamId: string; accountId: string },
      Member
    >;

    byIdLoader = new DataLoader<string, Member>(async ids => {
      return ids.map(id => {
        // TODO Perform some actual fetching here
        const someRes: Member = {
          id,
          team: 'batchfn-random-team-id',
          account: 'batchfn-random-account-id'
        };
        byTeamAndAccountIdLoader
          .clear({ accountId: someRes.account, teamId: someRes.team })
          .prime({ accountId: someRes.account, teamId: someRes.team }, someRes);
        return someRes;
      });
    });

    byTeamAndAccountIdLoader = new DataLoader<
      { teamId: string; accountId: string },
      Member
    >(async ids => {
      return ids.map(({ teamId, accountId }) => {
        // TODO Perform some actual fetching here
        const someRes: Member = {
          id: 'batchfn-random-id',
          account: accountId,
          team: teamId
        };
        byIdLoader.clear(someRes.id).prime(someRes.id, someRes);
        return someRes;
      });
    });

    const loaded = await byIdLoader.load('some-random-id');

    const loaded2 = await byTeamAndAccountIdLoader.load({
      accountId: loaded.account,
      teamId: loaded.team
    });

    expect(loaded).toStrictEqual(loaded2);

    byTeamAndAccountIdLoader.clear({
      accountId: loaded.account,
      teamId: loaded.team
    });
    byIdLoader.clear(loaded.id);

    const loaded3 = await byTeamAndAccountIdLoader.load({
      accountId: loaded.account,
      teamId: loaded.team
    });

    expect(loaded3.id).not.toEqual(loaded.id);
  });
});

In this snippet, we set up two DataLoaders that access members by different attributes, both unique key combinations that should return one exact member. When loading a member by id, we also made sure to cache that entity by teamId and accountId and vice-versa.

While this looks straightforward with two loaders, imagine adding a couple more loaders and methods that all fetch and manipulate members. It’s easy to forget a single .clear().prime() in one place and cause issues that are incredibly hard to debug.

With this in mind, let’s go over the second example that uses the improved DataLoader architecture.

import { DataLoaderStorage } from './dataloader';

interface Member {
  id: string;
  team: string;
  account: string;
}

describe('dataloader', () => {
  it('should use same underlying data store', async () => {
    const storage = new DataLoaderStorage<Member>();

    const byIdLoader = storage.createDataLoader<string>(
      value => value.id,
      async ids => {
        return ids.map(id => {
          const someRes: Member = {
            id,
            team: 'batchfn-random-team-id',
            account: 'batchfn-random-account-id'
          };
          return someRes;
        });
      }
    );

    const byTeamAndAccountId = storage.createDataLoader<{
      teamId: string;
      accountId: string;
    }>(
      value => ({ accountId: value.account, teamId: value.team }),
      async ids => {
        return ids.map(({ accountId, teamId }) => {
          const someRes: Member = {
            id: 'batchfn-random-id',
            account: accountId,
            team: teamId
          };
          return someRes;
        });
      }
    );

    const loaded = await byIdLoader.load('some-random-id');

    const loaded2 = await byTeamAndAccountId.load({
      accountId: loaded.account,
      teamId: loaded.team
    });

    expect(loaded).toStrictEqual(loaded2);

    byTeamAndAccountId.clearValue(loaded);

    const loaded3 = await byTeamAndAccountId.load({
      accountId: loaded.account,
      teamId: loaded.team
    });

    expect(loaded3.id).not.toEqual(loaded.id);
  });
});

In this snippet, all cross-referencing cache invalidation and priming have disappeared, but we still get to enjoy the benefits of no additional roundtrips once the entity has been loaded once.

We can also easily add and clear cached values by accessing the storage object.

Part 1: A generic map

First, we need a place to store our cached values. As we want to store arbitrary key-value pairs, we cannot assume a specific type for our key, but we always store the same value type (as we use one storage map per entity type).

The map simply implements the functions by delegating calls to the superclass after transforming the passed key on the fly. We do this as we want to store string keys in the map exclusively, which makes it possible to compare non-string keys later on.

class StorageMap<Value> extends Map<unknown, Promise<Value>> {
  private transformKey(key: unknown): string {
    if (key === null) {
      throw new Error('Key cannot be null');
    }

    if (typeof key === 'string') {
      return key;
    }

    // Sort object keys
    if (typeof key === 'object') {
      const sortedEntries = Object.entries(key).sort((a, b) =>
        a[0].localeCompare(b[0])
      );
      return JSON.stringify(sortedEntries);
    }

    return JSON.stringify(key);
  }

  // Modified map fns to match generic identifier type onto string map (by transforming non-string keys)
  public get(key: unknown): Promise<Value> | undefined {
    return super.get(this.transformKey(key));
  }

  public set(key: unknown, value: Promise<Value>): this {
    super.set(this.transformKey(key), value);
    return this;
  }

  public clear() {
    super.clear();
  }

  public has(key: unknown): boolean {
    return super.has(this.transformKey(key));
  }

  public delete(key: unknown): boolean {
    return super.delete(this.transformKey(key));
  }
}

Part 2: A shared storage object

After implementing the map, we can create the core class that connects our DataLoaders. The DataLoaderStorage class holds a list of registered DataLoader instances, which can accept arbitrary keys but offer a key function to retrieve the key for a given entity.

Whenever any connected DataLoader attempts to store a value, it invokes the primeValue function which stores the value for every known key (by invoking the keyFn of each registered DataLoader). This makes sure the entity is immediately accessible for all DataLoaders, reducing further database calls.

export class DataLoaderStorage<Value> {
  private storage = new StorageMap<Value>();
  private registeredLoaders: SharedStorageDataLoader<unknown, Value>[] = [];

  public constructor() {}

  /**
   * Retrieve storage map
   */
  public get() {
    return this.storage;
  }

  private keysFn(value: Value): unknown[] {
    const keys: unknown[] = [];
    for (const registered of this.registeredLoaders) {
      keys.push(registered.keyFn(value));
    }
    return keys;
  }

  /**
   * Create a new DataLoader linked to this storage instance.
   *
   * @param keyFn
   * @param loadFn
   */
  public createDataLoader<Identifier>(
    keyFn: (value: Value) => Identifier,
    loadFn: DataLoader.BatchLoadFn<Identifier, Value>
  ): SharedStorageDataLoader<Identifier, Value> {
    const loader = new SharedStorageDataLoader(this, keyFn, loadFn);
    this.registeredLoaders.push(loader);
    return loader;
  }

  /**
   * Primes given value for every possible key. Will overwrite existing values!
   *
   * @param value
   */
  public primeValue(value: Value) {
    const keys = this.keysFn(value);
    for (const key of keys) {
      this.storage.set(key, Promise.resolve(value));
    }
  }

  /**
   * Removes every given key for the given value.
   *
   * @param value
   */
  public clearValue(value: Value) {
    const keys = this.keysFn(value);
    for (const key of keys) {
      this.storage.delete(key);
    }
  }

  public clearAll() {
    this.storage.clear();
  }
}

Part 3: Shared storage DataLoaders

Our storage class uses a SharedStorageDataLoader, which extends the classic DataLoader class, requiring a storage and keyFn argument at the time of creation.

Whenever one or more values are to be loaded, we simply delegate the call to the underlying DataLoader, which handles batching and store the resolved value in our storage map. Then, we invoke primeValue on the connected storage, making sure the resolved entities are accessible by other loaders without any further database calls.

Unfortunately, we still have to implement the clear function, which receives a single key to clear the associated entity. I say unfortunately because that destroys the purpose of our connected system: We always want to invalidate all cached entries in the storage map associated with a given entity, passing a single key only allows us to clear one entry unless we load the currently-cached entity and pass it to clearValue, which is a potential workaround. For now, it’s recommended to use clearValue over clear.

import DataLoader from 'dataloader';

export class SharedStorageDataLoader<Identifier, Value> extends DataLoader<
  Identifier,
  Value
> {
  /**
   *
   * @param storage a DataLoaderStorage instance that accepts a superset
   * of the identifiers this DataLoader deals with and the same value type.
   * @param keyFn
   * @param loadFn
   */
  constructor(
    private storage: DataLoaderStorage<Value>,
    public keyFn: (value: Value) => Identifier,
    loadFn: DataLoader.BatchLoadFn<Identifier, Value>
  ) {
    super(loadFn, {
      cacheMap: storage.get()
    });
  }

  async load(key: Identifier): Promise<Value> {
    const loaded = await super.load(key);
    this.storage.primeValue(loaded);
    return loaded;
  }

  async loadMany(keys: Identifier[]): Promise<(Value | Error)[]> {
    const loaded = await super.loadMany(keys);
    for (const value of loaded) {
      if (!(value instanceof Error)) {
        this.storage.primeValue(value);
      }
    }
    return loaded;
  }

  /**
   * Clears value but ONLY in current DataLoader
   *
   * @deprecated (required for DataLoader interface compatibility): Use clearValue instead
   * @param key
   */
  clear(key: Identifier): this {
    return super.clear(key);
  }

  public clearValue(value: Value): this {
    this.storage.clearValue(value);
    return this;
  }

  prime(key: Identifier, value: Error | Value): this {
    if (value instanceof Error) {
      this.clear(key);
      return this;
    }

    this.storage.primeValue(value);
    return this;
  }
}

With some delegation and a shared map we made sure that all our DataLoaders are connected, priming and invalidating all loaders at once, reducing database calls and making it easier to reason about our code by providing a simple abstraction.

All of that sounds great, but what’s the catch? This idea only works for loaders where unique keys return one entity! Let’s explore this issue in the next part.

Dealing with multi-loaders

In some cases, you might want to load multiple entities by a parent ID.

const byTeamIdLoader = new DataLoader<string, Member[]>(async ids => {
  // TODO Perform some actual fetching here
  const fetched: Member[] = [
    {
      id: 'batchfn-random-id',
      team: ids[0],
      account: 'batchfn-random-account-id'
    }
  ];

  return ids.map(id => fetched.filter(f => f.team === id));
});

In this case, we also want to make sure our other DataLoaders are kept in sync. To do that, we can invoke primeValue on our storage, which will update the storage for all connected DataLoaders.

const byTeamIdLoader = new DataLoader<string, Member[]>(async ids => {
  // TODO Perform some actual fetching here
  const fetched: Member[] = [
    {
      id: 'batchfn-random-id',
      team: ids[0],
      account: 'batchfn-random-account-id'
    }
  ];

  fetched.map(f => storage.primeValue(f));

  return ids.map(id => fetched.filter(f => f.team === id));
});

What we cannot do, however, is to connect DataLoaders like the one above to our shared storage. The simple reason for this is that loading one entity wouldn’t give us the other entities required for satisfying the request, or put differently when we load one team member, we can’t guess the rest, so we cannot prime the byTeamIdLoader.

This is something to keep in mind, but usually, the data flows in the opposite direction: When you load all team members, you prime single members as well, so whenever we load a member by ID subsequently, we’ve got the value loaded already. In that case, you can get by with a single database call for multiple use cases.