Apr 03, 2024

Building a Hybrid Search Experience

You might have noticed a recent addition to the top right corner of this blog: I’ve added a global search experience. More specifically, I’ve implemented a hybrid search solution based on OpenAI Embeddings, Supabase, and PostgreSQL full-text search. In this post, I’d like to run through the design decisions I made building this system and explain how the current iteration works end to end.

Previously

Long-time readers may recall previous iterations of this blog that included an Algolia-powered search experience. When I switched out Gatsby in favor of Next.js for the foundation of the current stack in 2020, I removed all the bells and whistles I didn’t want to maintain anymore, including the in-house newsletter and search solutions. I wanted to focus on the essence of the blog, the content.

Over time, discoverability has become a challenge: New readers have a hard time finding content they’re curious about, and I’m losing track of what I wrote about years ago. Introducing a new search experience is one of the first steps I’m taking to improve content discoverability, with more changes to follow in the coming weeks.

When I started exploring the possible ways to implement search, I decided not to run with a managed solution like Algolia. Instead, I want to build a search stack tailored to the blog’s needs.

When users search for content, they might remember some detail of a post they read, or they might be interested in a certain technology. They might have come across one of my startups or projects. They might enter an area of work like system architecture, continuous integration, or containers. And they may paraphrase, use synonyms, or other related words.

In short, they will use a thousand possible ways to reach the same goal, finding a post. And I need to supply the right tools to make this task easier.

Keyword or full-text search allows finding exact matches, for instance, returning all posts mentioning CodeTrail, my most recent startup. This breaks down when you diverge from the phrasing I’ve used. Replace VM with virtualization and you’re out of luck.

Semantic search operates on learned representations of concepts in vector space. Imagine having to group similar words in a room. You might put ice closer to cold, and coffee closer to hot and warm. Containers are close to virtualization and far away from surfboard. So-called embedding models are trained to group words in an n-dimensional space minimizing the space between tokens that should be semantically similar. Comparing the embedded vector representation of a query string with the embedded vector representation of a blog post’s content using cosine similarity ranks closely-matching (approaching 1) or unrelated (approaching 0) results. Intuitively, this works because minimizing the angle in each learned dimension of the vector space corresponds with close semantic similarity.

Semantic search is a powerful tool to fetch semantically similar results for a query and the same result for semantically similar queries at the same time.

Semantic search operates on a predefined vocabulary that might not know about the latest technology. Running “Apple Vision Pro” through an embedding model trained before the last WWDC will most likely not arrange it closely to “virtual reality”. This is why semantic search should be coupled with a more traditional search strategy respecting keywords.

Hybrid search approaches combine semantic search with keyword or full-text search to get the best of both worlds. Reciprocal Rank Fusion (RRF) is an algorithm that combines two ranked result lists into one unified result set. In our implementation, we will run semantic search and full-text search separately and use RRF to create a single list of results.

Choosing the stack

The current blog iteration is built with Next.js, and deployed on Vercel. Every post is a Markdown file, assets like images are uploaded and distributed separately. Markdown is processed using the unified ecosystem of tools including remark and rehype and finally converted to React components and rendered in the browser.

To store full-text search and embedding vectors, I decided to create a PostgreSQL instance running Supabase with the pgvector extension enabled to store vector columns and perform semantic search using cosine similarity operators between query and embedding vectors.

To convert natural language search queries to a suitable representation in vector space, I’m using the OpenAI Embeddings endpoint with the recently launched text-embedding-3-small model.

I’ve built an indexer service running on every push using a GitHub Actions workflow. All search requests are sent to a Vercel Edge Function endpoint. I use PostHog to understand how the search experience is adopted.

Most of the decisions are pretty standard for a small MVP-stage project and will serve for a while. I’ve deliberately planned this feature as an experiment to improve discoverability issues, and I might replace it with a better solution in the future.

End-to-End Implementation

To populate the posts database with new or updated post content, I’ve created a GitHub Actions workflow running an index script on push or manual dispatch.

The indexer loads all Markdown post content from disk, extracts frontmatter, and iterates over each post. It creates a full-text search vector incorporating post title, slug, keywords, and other metadata. This procedure can be changed in the future.

To prevent generating vector embeddings for unchanged posts, we calculate a hash of the current file content and compare it to the current post entry in the database. If the post hasn’t been indexed previously or hashes don’t match, we create vector embeddings for the post.

To support longer content, we split the Markdown content into chunks using langchain’s MarkdownTextSplitter. We then create embeddings for each chunk using OpenAI’s text-embedding-3-small model with just 512 dimensions compared to the legacy text-embedding-ada-002's 1536 dimensions while achieving better performance.

Finally, we insert post embeddings into the database using the vector column data type provided by pgvector.

At runtime, when a user enters a search query, Next.js sends a client-side GET request to the search endpoint deployed using Vercel Edge Functions. We try to cache responses to the same query to reduce load, if possible. The function generates vector embeddings for the query string to calculate cosine similarity later on. We also generate a full-text search vector using websearch_to_tsquery.

Finally, we invoke the hybrid search routine, which separately finds the best matches for semantic and full-text search and applies RRF to calculate a unified result set. We also apply user filters on keywords, publishing time, etc.

Moving to Production

To provide a predictable experience, I’ve instrumented both frontend and backend, adding PostHog to analyze search usage for future improvements. I’ve also added service quotas to prevent adversarial users from exhausting system resources and configured billing limits in third-party services to prevent unexpected overspending.

Iterations

Initially, I created a minimal search experience using only semantic search. This worked well for semantically similar queries but returned suboptimal results when adding little-known entity names. Adding keyword search and combining results using RRF solved this issue.

To help users find specific content more easily, I enabled filtering support on topics, which are used throughout the blog to tag areas of work, programming languages, tools, and more.

In the future, I’ll probably add more filters and tune hybrid search parameters to perform better. This should make it even easier to sift through all posts to find the needle in the haystack.


Thanks for reading all the way through this! This project has been an interesting challenge, building a usable search experience takes more effort than I initially expected. The first public release is now live, feel free to try it by clicking the search button on the top right or hitting ⌘K.