Unbowed, Unbent, Unbroken

How to overcome the problems you’ll encounter when working with Large Language Models.

A cheatsheet for teams working on their first LLM applications.

So, your company or organization is building some kind of application that involves a Large Language Model (LLM) and you’ve hit a roadblock.  This is normal.  Breathe out, take a moment and then scan the guide below.  It’s likely that your problem is one of these. 

The purpose of this article is to help point you in the direction of the solutions.  I’ve included links and references for further reading.  Some of them (like RLHF) are a bit finnicky to get to work in practise.  In the future, I’ll write up some deep-dives into these areas.

Problem 1. Regulatory restrictions, IP concerns or privacy issues mean we cannot send data to an LLM provider.

You have two options: either you remove the worrisome material before you send your data out, or you never send it out at all.

If you take the former approach, take a look at libraries like Microsoft’s Presidio, which will capably redact many common forms of PII.  Additionally, ask a friendly data scientist to design some classifiers that will catch any forms of sensitive data which Presidio misses.  Alternatively, look at a specialist provider like PredictionGuard – who have put a lot of work into solving this problem already – and integrate their service into your solution.

The latter approach involves using an open source LLM and hosting it yourself.  (This is not necessarily as intimidating as it might sound; tools like SkyPilot will handle a lot of the complexity for you.)  At the time of writing, there are some incredibly powerful, open source LLMs (e.g. Mistral Large) that will excel at all but the most complex of reasoning tasks. 

Problem 2.  I have a very domain-specific use case for my LLM.  For example, a code assistant that needs to work with custom libraries and schemas.

OK, so you probably need to use an open source LLM here.  You’ll also need to curate a decent dataset of examples spanning your specialist domain.  The technique you’ll want to understand is called either “continued pre-training” or “domain-adaptive pre-training” (Nvidia used this technique to train their ChipNeMo model). This involves using a large volume of sample data and the standard, next-token prediction task to teach your LLM the required new knowledge, for example how to code using your company’s bespoke libraries.

Problem 3. The LLM is not reliable enough.  It often makes mistakes and/or fails to answer questions correctly.

Much like teenagers, although LLMs are highly capable of following sophisticated reasoning patterns, they often completely fail to.  Prompt engineering is the art of guiding an LLM to structure it’s reasoning and responses in a manner appropriate to the problem at hand.  Prompt engineering techniques (and there are lot of them out there!) are designed to enforce correct reasoning, ensure that answers match the stated requirements and reduce hallucinations.

You can find an excellent overview of prompt engineering techniques here but this is one area where it’s worth staying close to the research from sources like arXiv or HuggingFace daily papers .  On a monthly basis, new prompt-engineering techniques will emerge – often for specialist use cases – each seeking to squeeze an extra few percentage points of correct answers from your LLM application.  Build, iterate and improve.

Problem 4. The LLM is not generating responses in the format that I want: e.g. I want it to be more formal, more succinct, more chatty, or to match a house style.

In contrast to Problem 2, the issue here is not that the LLM lacks any particular knowledge, but that its behaviour needs to be modified.  We can modify the behaviour of LLMs in two ways: with in-context learning (ICL) and with fine-tuning.

In-context learning simply involves offering one or more curated examples to the model at query-time, demonstrating how you’d like it to respond.  All models have a context window which does limit the amount of text you can give it, but these are getting extremely large (for GPT4-turbo it’s ~100k words; for Claude Opus it’s ~150k words; for Gemini Pro it’s ~750k words) so you can use a lot of examples.  ICL is quick to implement and very often will solve the problem.

The downside of ICL is that you’re now paying to process many more tokens and the response times may slow appreciably.  The alternative is to use supervised fine-tuning (SFT) to teach the model what good responses look like.  You’ll need a minimum of a few hundred, high quality query/response examples (perhaps more; YMMV).  Since you’re actually adjusting the model weights during SFT, you’ll either need to be working with an open source LLM or using a commercial model that supports fine-tuning (e.g. GPT-3.5 Turbo).  However, once you’ve fine-tuned, you’ll be able to get away with few or no ICL examples.

<s>[INST]{your_instruction}\n{additional_context}\n[/INST]{best_response}</s>

Problem 5.  Supervised fine-tuning is not working.  It’s too difficult to generate a large enough and wide enough variety of examples.

You’re not alone.  It’s very difficult to know that you’ve covered a wide enough range of model query/response pairs to cover your application.  It’s also expensive and time-consuming to write them.

For instance: in actual usage, you may often see requests without analogues in our SFT dataset and the model might not respond well in these situations.

OpenAI faced the same problem when developing the precursor to ChatGPT.  Their solution was a process called Reinforcement Learning from Human Feedback (RLHF).  The idea is that rather than have humans curate “ideal” responses, we allow the model to generate several responses to our queries (yes, we still need a large bank of example queries to make this work) and we tell it which one we prefer.  After doing this many times, we can train a “reward model” that learns which answer we are likely to prefer.  Now we can set the reinforcement learning algorithm off.  It will keep reading queries, generating responses and marking its own homework.  Over time, it should begin to reliably generate responses matching the required style and format.

RLHF is a fiddly technique and will be the subject of a forthcoming “lessons learned” blog (where I’ll explain everything I learned actually making it work in practise) but it is an extremely powerful tool for getting the “last mile” of behavioural adaption from your LLM.  For now, a good overview of the technique can be found here.

Problem 6.  The LLM is supposed to be searching or referencing my content but often fails to locate the most relevant material.

You’re building a solution which uses Retrieval-Augmented Generation (RAG).  There’s likely one of two things going wrong here: either you application is not retrieving the right content, of the LLM is not picking out the relevant information from it.

A lot of ink has been spilled about resolving the former.  The essentials are as follows: RAG relies upon an embedding (how queries and documents are vectorised) and, optionally, a reranker (which sorts matching documents by relevancy).  You can test different combinations of these components out on your dataset using the handy Retrieval Evaluation module in the LlamaIndex library to find what works best for your data.  And if nothing does… you can always ask that friendly data scientist to fine-tune you your own embedding model.

If retrieval and reranking is not the problem – or if you’re not using RAG but are simply trying to get the LLM to extract information from a long document in the prompt – then you may be suffering from the “lost-in-the-middle” syndrome.  Even with long context windows, LLMs can often fail to extract key information when it appears in the middle of a document.  See this tweet for a good illustration of the problem.  The best solution here is to use some prompt-engineering: split the long context up into shorter blocks and feed them to the LLM one at a time, then combine the individual analyses in a final prompt.

Problem 7.  It’s proving too expensive.  My application needs the power of a frontier model but the scale of requests I make pushes the price too high.

There’s a fierce price-war going on between the vendors of frontier models, with prices dropping periodically.  Still, for some use cases, you could end up making an awful lot of requests and the bills can rack up.  Some suggestions for tackling this:

The “Master / Apprentice” set-up.  (Don’t Google that term; I think I just made it up.  It does work, however!)  This works when you have a complex, multi-step reasoning task that involves making multiple queries to the LLM.  Essentially, you pass your initial, high-level query to a powerful LLM and have it write a “recipe” for resolving the query.  You then pass each step in the recipe to a smaller, cheaper LLM (it could be one you host yourself) and ask it to perform only that step.  You accumulate (and check) your interim results until the entire process has been completed.

Caching.  RAG isn’t just for document search.  You can maintain a vector datastore of previous LLM responses to user queries and check it each time a new query is submitted.  If you find a close match, you could return the previous result.  (Obviously this only works if responses remain relevant over a period of time and if the LLM has not had to produce specific, query-related details as part of it’s response.)

Efficient Compute.  If you’re using an open source LLM, try hosting it on a service like Beam, where you can set your LLM to be idle when not being used.  If you application is used only periodically, you can get substantial cost-savings from a set-up like this.

Problem 8. My LLM is too slow.

Ah, this one’s caused me a few headaches in the past. If you’re using a frontier LLM service then you’re options are limited: either reduce the size of your queries (fewer ICL examples, fewer RAG documents), reduce the number of queries (try to collapse a multi-step reasoning process into fewer steps) or instruct the model to produce more succinct answers. If you’re using an open source LLM, you have more options.  Try any or all of the following:

  • Run an SFT or RLHF process to teach the model to produce more succinct responses.
  • Make sure you’re using vLLM (or similar) to serve your model.
  • Quantize your model using the scheme best suited to your inference hardware.  (See here for a good overview).
  • Use a mixture-of-experts model (e.g. Mixtral-8x7B) .  This increasingly popular architecture gives you the raw power of a larger model but only requires ~25% of the weights to be activated for any one inference task, offering a considerable increase in token throughput.
  • Reduce the top_p and top_k parameters that have been set when sampling from your model.  Reducing either of these directs the LLM to choose between a smaller set of potential next tokens.  This will increase token throughput but at the expense of reducing response diversity.

So there it is, what I think are the eight most common roadblocks to developing an LLM and some pointers for resolving them. I hope this helps somebody.