Processing Markdown with remark and unified plugins

Markdown is one of the most popular formats for writing anything on the web, ranging from documentation, to everything on GitHub, to blog posts. The CommonMark specification provided a standardized collection of supported tags to use for formatting and set readability as the primary goal.

Extending markdown with new elements, or specific functionality used by your or your organization, is relatively easy as well. GitHub Flavored Markdown (gfm in short) is one of the more popular extensions.

Working with markdown, however, requires a set of tools to easily modify some data structure representing your content. As modifying strings is quite error-prone and leads to unmaintainable logic quickly, the remark Markdown processor has been built on top of unified. I'll explain what each of these tools does in a second.

As we said before, we don't want to operate on raw Markdown strings, so we need to parse it into some more machine-readable structure, and that's an abstract syntax tree (AST). This is where unified comes in, as it provides interfaces for working with syntax trees. Remark then uses this infrastructure for managing markdown.

Let's think of an example use case of working with Markdown, altering the document to better suit our needs. A relatively straightforward problem I found myself fixing recently is creating excerpts for blog posts based on the content.

import unified from 'unified';
import remark from 'remark';
import rehypeToReact from 'rehype-react';
import remarkToRehype from 'remark-rehype';
import parseMarkdown from 'remark-parse';
import githubFlavoredMarkdown from 'remark-gfm';
import stripMarkdown from 'strip-markdown';

export async function createMarkdownExcerpt(content: string) {
  const { contents: excerpt } = await remark()
    .use(parseMarkdown)
    .use(githubFlavoredMarkdown)
    .use(stripMarkdown)
    .process(content);

  const shortened = excerpt.toString().trim().substr(0, 240);

  return `${shortened}...`;
}

As you can see, this solution is really simple, and you might want to customize it depending on your use case. One issue I quickly saw coming up was that strip-markdown will transform image elements to their alt tags. If your post looked something like this

![header](some_file.png)

Content comes here

you would end up with the following excerpt

header Content comes here

This is not useful at all, as alt tags are really important for accessibility, but should not be included in the excerpt. Unfortunately, strip-markdown doesn't allow you to ignore tags altogether, so we need something different.

Stepping back, we already run a chain of processors on our source data. As you can see in the first code block, we first parse markdown into an AST unified can work with, then apply GitHub flavored markdown on top of that, followed by stripping all tags again to get the raw text for our excerpt.

If we could remove all image nodes in our AST before stripping the tags, they would not get processed and rendered in the first place. And this is the solution I came up with:

import { Node } from 'unist';

/**
 * Cleanup node to remove all matching tags down the tree
 * @param n
 * @param tags
 * @returns processed node
 */
function removeTags(n: Node, tags: string[]) {
  // If node has chi
  const children = n.children;

  // If node doesn't have children, there's nothing to do
  if (!children || !Array.isArray(children)) {
    return n;
  }

  // Store processed children
  const result: Node[] = [];

  // Iterate over children
  for (const child of children) {
    // Child should probably be a node as well
    const childNode = child as Node;

    // If we stumbled over a matching tag, skip
    if (tags.includes(childNode.type)) {
      continue;
    }

    // Make sure to process the child node as well
    const processed = removeTags(childNode, tags);

    // And push to the result list
    result.push(processed);
  }

  // Then update the children
  n.children = result;

  // And return again
  return n;
}

const stripTags: unified.Plugin<[{ tags: string[] }]> = ({ tags }) => {
  return node => {
    // This will only be called once with the whole AST
    if (node.type === 'root') {
      removeTags(node, tags);
    }
    return node;
  };
};

It's quite primitive, but it does the job. Our new stripTags plugin receives an array of element types (or node types) it should remove from the final output.

We're passed in the root node, which contains all content as children. We then recursively invoke removeTags to check for existing children, then remove all matching nodes from that list, and return the element again. The result is a filtered Markdown tree, which is then passed into the next processor down the chain.

We can integrate our new plugin like this

const { contents: excerpt } = await remark()
  .use(parseMarkdown)
  .use(githubFlavoredMarkdown)
  // We remove all image and imageReference tags
  .use(stripTags, { tags: ['image', 'imageReference'] })
  .use(stripMarkdown)
  .process(content);

This is an easy entry point for working with markdown and insert, modify, or delete existing nodes in the tree. Working on an AST allows you to build plugins that can be tested and understood quite easily.

I hope you enjoyed this post, if you have any feedback, questions, or suggestions, send a mail or reach out on Twitter.