Working with Parsers and ASTs

For the last couple of days, I pushed through the tedious task of setting up a Markdown parser and formatter in Java consistent with our existing system written in TypeScript. For this first iteration, I decided to customize flexmark-java, a popular Markdown parser library for Java based on the commonmark-java package. For CodeTrail, we store internal documentation as Markdown as part of your codebase. For the biggest part, this is plain old GitHub-flavored Markdown, but we’ve added some custom syntax for our features.

I won’t be going through the specifics of this ordeal, not because it’s not that interesting but quite frankly because I haven’t sufficiently grokked how the parser works to explain it in this post. Instead, I’ll go over the approach I used to understand the parser codebase I hadn’t used before to achieve feature parity across implementations. Most of my processes can be used for other problems, they’re general-purpose engineering knowledge.

In an interesting turn, working on this project reminded me of the early days of Hygraph when I worked together with my colleagues to implement a core service based on a GraphQL parser. We worked on lots of AST magic, and I was able to use some of our insights from back in the day.

Choose the right tools and you’re halfway there

To implement custom parsing and formatting behavior for our custom Markdown blocks, I systematically debugged the parser at every step, figuring out the internal structure and approach used to turn a Markdown document string into an AST data structure. The purpose of this task wasn’t to fully understand the codebase but to find potential extension points to use.

For our reference TypeScript implementation, we’ve set up a range of tests to ensure Markdown processing behavior doesn’t break over time. I set up JUnit and translated some fundamental test cases. Not only does this help with maintaining parity in terms of behavior, but it’s a great starting point.

Fundamentally, we write tests that start with a source Markdown string, which gets parsed by our customized parser. The resulting AST is checked for correctness, then we run it through a formatter to produce another Markdown string. This is compared to the initial document and should be a match. This end-to-end flow ensures that the entire pipeline works as expected and leads to consistent outputs. We wouldn’t want users to receive slightly different Markdown files every time they saved a document without changing anything.

In addition to setting up a solid testing workflow and using the debugger, using the decompiled classes isn’t a great help. Go and modern JavaScript codebases allow to vendor or “fork” dependencies effortlessly. Unfortunately, I haven’t found a well-documented way to achieve this in Java, so I forked the entire parser for the time being. This way, we get direct access to the Java source and can modify anything we need. This may not be the optimal setup for dependencies with frequent updates, but thankfully, parsers aren’t really changing a lot.

Are two implementations worth it?

With the Java implementation, we now have a grand total of two aligned Markdown parsers. While they may be consistent for the time being, for any new feature or fixed bug we’ll have to maintain parity. This requires a certain effort we should weigh against alternative implementations.

The biggest benefit of a simple Java implementation is that it’ll run anywhere our remaining Java code already runs (portability). It doesn’t require any downloads, platform-specific code, or other complexity. It’s just Java code.

We’ll evaluate the current system through the coming weeks and carefully analyze if the benefit is worth the downsides of potentially inconsistent behavior and the effort to keep both implementations aligned. In case we want to opt for an alternative, we’ve already researched a bunch of options that I’ll cover in the next section.

Alternative Approaches

While a dedicated implementation in Java is the most straightforward implementation to plug into our Java codebase, we may decide on a different design in the future.

A hard requirement is to run everything on the developer’s machine. This means we cannot simply host our parser and send requests via HTTP. In an ideal world, the parser could be embedded in our Java application to ensure full portability without any additional required dependencies.

Since we already maintain a reference Markdown parser implementation in TypeScript, which is used by multiple components in the system, it makes sense to reuse that code for our Java codebase.

Unfortunately, due to several dependencies and JavaScript ecosystem quirks it’s not as straightforward as transpiling our JavaScript source to Java. Instead, we can instruct our Java code to invoke a customized entry point for the JavaScript Markdown parser.

This entry point could be a simple CLI. In this case, we’d bundle all parser-related code into one file and make it executable. We don’t really want to force our users to install Node.js on their system, so we’d have to use a tool like bun build --compile, deno compile, pkg, or nexe to create a single executable including the JavaScript runtime. These files are typically in the range of 100MBs that need to be delivered to our users.

If we accept the large binary sizes and requirements to build and download platform-specific versions of the parser CLI, this approach is pretty smooth.

Alternatively, we could use an embedded JavaScript engine. Java bindings for V8 have existed for some time now and communication is straightforward thanks to the built-in serialization features.

We’re cooking up an incredible new version of CodeTrail as you’re reading this. If you’re curious about the engineering side, don’t forget to check out Tim’s recent post on another area of the codebase. And make sure to follow our blogs so you don’t miss any updates!