AI & Automation Services
Automate workflows, integrate systems, and unlock AI-driven efficiency.



AI hallucination is when a large language model generates information that is factually incorrect, does not exist, or is directly contradicted by available evidence, presented with the same fluency and confidence as accurate information. It is not a malfunction. It is a structural characteristic of how LLMs generate text: they produce the most statistically likely continuation of a prompt based on patterns in training data, not by retrieving verified facts from a database. When a pattern exists for a type of answer but the specific fact asked for is not in the training data, the model fills the gap with a plausible but incorrect response. For UK businesses using AI in production workflows, understanding hallucination, identifying where it creates risk, and putting specific mitigations in place is not optional: it is the difference between a useful system and a liability.
LLMs generate text by predicting the next token (roughly, the next word fragment) in a sequence, based on the statistical patterns learned during training. This prediction process does not distinguish between retrieving a known fact and filling a gap. The model generates the most statistically plausible text given the prompt and context. When it encounters a question about a specific fact that was not well-represented in its training data, it produces an answer that looks like the correct type of answer for that question, even when the specific content is wrong.
The result: a model asked for the turnover of a specific small UK company will generate a plausible-sounding turnover figure if the actual data was not in its training set. A model asked to cite research on a specific topic will generate a plausible-sounding paper title, journal, and author if no matching paper exists in its training data. Both outputs look identical to accurate outputs.
Hallucination risk is not uniform. It is highest in specific types of queries and use cases.
Ground every response in retrieved documents rather than relying on the model's training knowledge. When the model generates an answer, it uses specific text passages retrieved from your verified knowledge base as context. Hallucination rate drops sharply because the model has accurate source material to work from rather than reaching into training memory. For factual query applications, RAG is the most effective single mitigation available.
Any LLM output that will be acted upon without further verification should not be acted upon without human review. This is not a technology limitation to engineer around: it is the correct operating model for AI in high-stakes contexts. Define which output categories require human review before action and enforce that requirement operationally, not just as guidance.
When possible, constrain the model to output from a defined set of options rather than generating free text. A model choosing between four categories from a list hallucinated at 0.4% in controlled testing. The same model generating free-text descriptions of the same categories hallucinated at 8.7%. Constrained outputs reduce hallucination by reducing the space of possible responses. Use structured outputs, classification tasks, and yes/no decisions where the use case allows.
Design your system to escalate to a human when the model's confidence is low rather than generating a response regardless of confidence. Prompting the model to say I do not have enough information to answer this accurately when it lacks the specific knowledge required is more useful than a hallucinated answer delivered confidently. Test your system specifically for cases where the correct answer is I do not know and verify that it responds appropriately.
Establish a regular audit process where a sample of AI outputs is checked against ground truth by a human reviewer. Track accuracy rate over time. Set a minimum acceptable accuracy threshold for each use case. If accuracy drops below the threshold, investigate the cause before the system causes downstream problems. Most production AI failures are preceded by a period of gradual accuracy degradation that auditing would have caught.
Before deploying any LLM-powered system in a business context, run it through these tests.
Not with current LLM technology. It can be reduced to levels that are acceptable for specific use cases through RAG, constrained outputs, and human review processes. The goal is not zero hallucination but hallucination rates low enough that the risk is manageable for the specific application. A customer support chatbot with a 0.5% hallucination rate on factual product questions is deployable with appropriate monitoring. A medical diagnosis system with the same rate is not.
Larger and more recent models generally hallucinate less on common knowledge questions than earlier, smaller models. However, they still hallucinate on specific factual queries, especially for information not well-represented in training data. Model size is not a substitute for RAG and human review in production applications where factual accuracy matters.
Regular sampling and verification against ground truth is the most reliable method. Build into your system a logging mechanism that records every query and response. Sample 50 to 100 responses per week and verify them. Track the accuracy rate. Any output type that regularly produces inaccurate results needs either mitigation or removal from the system's scope.
If you are building AI systems for your business and want to understand how to design them to be reliable and accurate in production, see our AI and Machine Learning Solutions service and our approach to Automation Test Engineering for AI systems.
Let us help
Talk to our London-based team about how we can build the AI software, automation, or bespoke development tailored to your needs.
Deen Dayal Yadav
Online