AI & Automation Services
Automate workflows, integrate systems, and unlock AI-driven efficiency.



AI hallucination in enterprise applications is not a rare edge case. In production environments handling thousands of queries per day, even a 2% hallucination rate produces 200 incorrect outputs daily. In customer-facing applications, a fraction of those reach customers before detection. In internal applications, they influence decisions. In regulated applications, they create compliance risk. This guide covers the specific testing methodology and production mitigations we use across enterprise AI deployments to keep hallucination at acceptable levels for each application type.
Before testing, define what acceptable accuracy looks like for your specific application. The acceptable hallucination rate varies significantly by context.
Build a test set of 200 to 500 questions for which you have verified correct answers from your knowledge base or domain documentation. Run the AI system against the full test set and calculate the accuracy rate. This establishes the baseline performance on questions the system should be able to answer correctly. Target: at least 5 percentage points above your defined acceptable threshold before proceeding.
Build a test set of 100 questions that the AI system should not be able to answer correctly because they fall outside its knowledge base. Include: questions about topics not in the training data, questions about events after the model's training cutoff, questions about specific proprietary information the system was not given, and deliberately incorrect premises. The system should respond with an acknowledgement that it cannot answer accurately rather than generating a plausible but fabricated response. Failure rate on this test is often higher than expected, revealing a tendency to fabricate rather than abstain.
Deliberately attempt to elicit hallucinations through specific prompt strategies: asking for citations of sources that do not exist, asking for statistics about topics where no data was provided, asking about specific individuals or companies where only general information is available, and asking leading questions that contain false premises. Record the percentage of adversarial prompts that produce hallucinated responses versus appropriate abstentions or corrections.
Collect the 50 most ambiguous or edge-case queries from your support history or predicted user behaviour. Test how the system handles queries where the correct answer is nuanced, where multiple answers could be partially correct, or where the right response is to ask a clarifying question rather than answer immediately. Evaluate whether the system's handling of ambiguity is appropriate for your context.
Run the full test suite against every significant update to the system: new knowledge base content, model version changes, system prompt changes, integration updates. A system that passes accuracy testing at launch can regress after updates. Automated regression testing that runs the standard test set on every deployment catches accuracy regression before it reaches production users.
Retrieval-Augmented Generation is the most effective single mitigation for factual hallucination. By grounding every response in retrieved documents from a verified knowledge base, the model's tendency to generate plausible-but-incorrect information from training memory is significantly reduced. In our experience across production deployments, RAG-grounded systems hallucinate at one-third to one-fifth the rate of ungrounded systems on factual queries within the knowledge base scope.
For applications where the cost of a wrong answer is high, implement confidence gating: responses below a defined confidence threshold are routed to human review rather than delivered directly. This requires a confidence estimation layer, either from the model itself (using chain-of-thought reasoning to evaluate its own confidence before responding) or from a separate classifier trained to predict when the primary model is likely to be wrong.
Requiring the model to cite the specific source passage it used to generate each response creates two benefits: it forces the model to ground its response in retrieved content, and it gives human reviewers a fast way to verify the response against the source. Any response that cannot be grounded in a specific source passage should either not be delivered or be clearly marked as the model's general knowledge rather than verified information.
Sample 50 to 100 production interactions weekly for human review. Track accuracy over time. Set an alert threshold: if accuracy in the weekly sample drops below the acceptable threshold for two consecutive weeks, halt new deployments and investigate the cause before the accuracy decline affects a larger proportion of users. Most accuracy declines in production are caused by knowledge base staleness (the world changed and the documents were not updated) rather than model degradation.
Active sampling with human review is the most reliable method. Passive detection through user feedback catches only the hallucinations that users notice and bother to report, which is a small fraction of the total. Implement a sampling protocol where a random selection of AI interactions is reviewed by a human evaluator weekly, regardless of whether users flagged issues. This systematic review catches the quiet failures that never surface through feedback channels.
No. Better models hallucinate less on common knowledge questions, but they still hallucinate on specific factual queries, particularly for information not well-represented in training data or for proprietary information they were not given. Model quality reduces the baseline hallucination rate but does not eliminate it. RAG, confidence gating, and human review remain necessary for production applications where accuracy matters.
To discuss how we design AI systems with hallucination mitigation built in from the start, see our AI and Machine Learning Solutions service and our Automation Test Engineering service.
Let us help
Talk to our London-based team about how we can build the AI software, automation, or bespoke development tailored to your needs.
Deen Dayal Yadav
Online