Hey there 👋 I'm building CodeTrail, which helps engineering teams document and share knowledge close to the codebase with no friction. If you're still using Notion, Confluence or Google Docs to document your engineering work, give it a try and let me know what you think!
Now that everyone and their dog have tried out Generative AI, teams around the globe are wondering how they can incorporate LLMs into their products for an upcoming fundraising effort or, hopefully, to create a great experience for their customers and solve some real problems. In recent months, countless reports detailing Generative AI adoption have presented larger concerns about getting this new technology rolled out in production.
As a software engineer, I’ve been following the formation of a vibrant ecosystem of projects around LLMs. As a co-founder of Gradients & Grit, I’ve explored production-ready ways of integrating LLMs into existing products.
A lot of progress has been made in the open-source model landscape. Yet, before building system-critical functionality on top of LLMs, teams need to address some underlying issues. In the following, I’ll explore the core areas holding back LLM adoption in production environments.
In the past, it was easy to design systems and trace outputs back to inputs. You could evaluate systems with requirements and compare different implementations. With machine learning and especially deep learning, the connection between inputs and outputs has gotten fuzzy by design. Distributed systems sit at every level, adding randomness and non-determinism to the mix.
With LLMs, differences in architecture, training data, inference, and parameters supplied in each request can lead to wildly different responses, without accounting for hallucinations. When you’re consuming a model through an API, there’s no guarantee that you receive a matching response for the same request twice in a row.
This creates complexity throughout the entire software engineering lifecycle, from development to production. For starters, it’s hard to nail the right prompt (one metric could be the best performance for the lowest token usage) for a use case. This prompt isn’t portable as you can’t switch the underlying model and expect similar results. You can’t even expect the same model to yield the same results in the future. With open-source models, it should be easier to pin your system to a specific model version, which is great for consistency. Similar to regular dependencies, you’ll have to update or at least re-visit prompts once you update the model.
If you’re self-hosting the model, you’ll need to worry about serving it in production, and handling inference on a cluster of machines. This is distributed computing with near real-time requirements, which has been a challenge for every other use case in the past. If you’re opting for a managed solution, you can offload some complexity at the cost of giving up control and your data. Later on, we’ll see that the latter is a relevant blocker for enterprise customers and government institutions.
Of course, in traditional software engineering, you had to spend time to get to a good solution while balancing trade-offs, too. Yet, the degree of fuzziness or lack of guarantees with LLMs and machine learning in general has to be accounted for. You need to be aware of the system’s biases leading to unexpected consequences for a wrong prediction. This is why most products have incorporated AI in nice-to-have features with human supervision. Running LLMs in the background without any safeguards may be an expensive mistake.
This is not to say that traditional software isn’t prone to bugs that can destroy livelihoods, it’s just harder to test notoriously unpredictable systems.
Another interesting facet of working with LLMs is that you’re switching between strict and fuzzy logic when crossing boundaries between code and prompts. LLMs accept a modicum of inconsistency when it comes to formatting prompts, so chaining prompts works better than expected. Factual inaccuracy in one response might lead to a game of telephone when passed along a chain of LLMs, though.
Unfortunately, it’s hard to evaluate correctness once you serialize the result back into code. Forcing models to output a well-defined format like JSON has shown to help, and support for steerability is getting better with newer models.
Let’s assume we’ve found a suitable use case for Generative AI and created a prototype for it. How expensive will it be to run this in production? Since OpenAI is subsidized by Microsoft, they can afford to operate at a loss throughout the near future, but unless you’ve raised a comfortable round recently, you might need to run a tight ship.
Running your own hardware is expensive. Hosted inference solutions scale with demand, which is great. Tokens can be cheap or expensive, based on the model’s capabilities. Some tasks need complex reasoning, while others can be done with less. Some prompts need a lot of details, some don’t. Figuring out the right balance requires a certain playfulness early on in the development process. Balancing different model characteristics can determine how feasible the business plan is down the road.
It’s important to understand the costs you will incur at different levels of scale. If it’s impossible to achieve positive unit economics, you’ll have a hard time in the best possible scenario. The bigger the role AI plays in delivering your product, the more you need to match your pricing. Traditional compute resources have become commoditized and allowed operating on incredibly high margins. Quantization and other efficiency improvements for models will eventually drive resource prices down.
All AI-enabled products I’ve recently worked on sounded simple in principle and required a surprising amount of hardening and preparing for production readiness before launching. Knowing the necessary measures to run and scale a system is as important as designing the product itself.
It’s still early enough for operational knowledge to be built and iterated on for the first couple of times. There are few absolute truths and most teams are learning on a daily basis. Some voices criticized that prompt engineering is often lacking the engineering part.
As with any new paradigm, teams will slowly try out Generative AI and iterate until they’re confident about delivering a high-quality experience to their customers. It may make sense to build internal tools first and start in the background, then expand to adding AI features to the product. Then, some companies are working at the cutting edge and building their entire business on top of Generative AI.
I’m convinced that the community will share their knowledge and work together so that every team can eventually benefit from the progress made in the biggest companies.
While it’s easy to put together a demo application consuming an API, building production-ready AI-enabled products requires many building blocks. For the same reasons that operational knowledge is still being created, solid tools for building LLM applications are yet to be created.
From tailoring prompts to analyzing model usage, tracking costs, monitoring errors, and evaluating consistency, the entire LLM lifecycle needs to be covered by the LLMOps movement. It’s already really easy to consume open-source models through APIs and hosted inference endpoints. Frameworks like Haystack and langchain make it easier to bring your own data, and I imagine more will follow in the months to come.
Depending on your customers, offering a product isn’t as straightforward as plugging into the OpenAI APIs. When you need to guarantee that customer data never leaves the jurisdiction, you’ll have to resort to regional service providers or resort to hosting models yourself. The latter is quite expensive, but if you’re aiming for government contracts, trust is an important currency.
The easier it becomes to run models with a smaller resource footprint, the easier it gets to run inference at the edge where it’s needed. Moving all components closer together also decreases latency, so customers will enjoy a snappier experience.
Most current service providers are US-focused and don’t sufficiently cover other geographic areas and jurisdictions. There’s a reason why Germany is seeing a ton of startups offering the same services available abroad for a GDPR-conscious enterprise sector. This may not be the best moat, but it’s certainly a market waiting to be served.
I’m hardly the first person who’s been thinking about running Generative AI in production. There’s a lot of hype and people expect LLMs to play a huge role in the years to come. To make this happen, we need to ensure that systems are reliable, consistent, cost-effective, and safe. Coming up with features to solve problems is one part of the solution, but proper execution requires solid software engineering. LLMs aren’t just used by researchers in labs anymore and the ecosystem has to adapt to this new reality.