AI & Automation Services
Automate workflows, integrate systems, and unlock AI-driven efficiency.

AI hallucination in enterprise applications is when a model produces confident output that is factually wrong, ungrounded in the source data, or invented outright, and it is the single biggest reason UK AI projects stall after the pilot. Base rates are real: GPT-4o hallucinates roughly 0.7% on general knowledge, around 6.4% on legal questions, and 10 to 20% on medical reasoning, with some retrieval-augmented legal tools measured as high as 33%. Hallucinations cost businesses an estimated $67.4bn globally in 2024, and knowledge workers now spend about 4.3 hours a week verifying AI output. At Softomate we treat hallucination as an engineering defect with a measurable rate, not a quirk. We hold production systems below an agreed threshold (typically under 2% on the golden set for medium-risk use), gate every release through automated evals, and tie every control to UK ICO and Data (Use and Access) Act 2025 obligations. This guide shows exactly how.
Last updated: June 2026
An AI hallucination is any output a model presents as true that is not supported by the underlying facts or the provided source material. In an enterprise setting that definition needs sharpening, because "wrong answer" is too blunt to test against. We break enterprise hallucinations into four distinct failure types, and each one needs a different test and a different fix.
The most familiar is the factual hallucination: the model states something about the world that is simply false, such as quoting a VAT rate that does not exist or naming a regulation that was never passed. Then there is the faithfulness or grounding failure, which is the dangerous one for retrieval systems. Here the model has the correct document in front of it but still contradicts it or adds detail the source never contained. The third type is the instruction hallucination, where the system ignores an explicit constraint, for example returning advice after being told to refuse and escalate. The fourth is the citation hallucination, where the model fabricates a reference, a case number, a clause, or a URL that looks authoritative but does not exist.
Our honest view: the industry obsesses over factual hallucinations because they are easy to demonstrate, but in real enterprise deployments grounding and citation failures cause far more damage. A chatbot inventing a refund policy that contradicts your own knowledge base will lose a customer and create a compliance liability faster than a stray trivia error ever could. That is why our testing weights grounding failures more heavily.
| Failure type | What goes wrong | Typical business impact |
|---|---|---|
| Factual | States a real-world fact that is false | Misinformed customer, brand erosion |
| Faithfulness / grounding | Contradicts or embellishes the source document | Compliance breach, wrong policy quoted |
| Instruction | Ignores guardrails or refusal rules | Unauthorised advice, scope creep |
| Citation | Invents references, clauses, or links | Legal exposure, lost trust, audit failure |
Classifying every failure into one of these four buckets is the foundation of the whole programme. You cannot reduce a number you refuse to define, and a single blended "accuracy" figure hides exactly the failures that hurt most. When we build an AI chatbot for a London business, the first artefact we produce is a labelled taxonomy of what counts as a hallucination for that specific use case.
Hallucinations happen because large language models are probabilistic next-token predictors, not databases, so they always generate a plausible-sounding answer even when they have no grounded knowledge to draw on. The UK ICO's January 2026 report on agentic AI made exactly this point, flagging that probabilistic models are inherently prone to hallucination and that organisations cannot treat their outputs as deterministic. Understanding the mechanism matters, because most production hallucinations are not random: they cluster around predictable conditions you can engineer against.
The first cause is weak retrieval. In a retrieval-augmented system, if the chunk fed to the model does not contain the answer, the model fills the gap from its training memory and confidently invents. We have measured this directly: improving retrieval relevance from mediocre to strong on a client knowledge base cut grounding failures by more than half before we touched the prompt. The second cause is data drift. Your knowledge base ages, prices change, policies update, and a system that was accurate at launch quietly degrades as the world moves on but the index does not.
The third cause is ambiguous prompts. When a user asks a vague or compound question, the model picks an interpretation and runs with it, and that interpretation is often wrong. The fourth, and most underestimated, is silent model updates. When a provider ships a new model version behind the same API, your carefully validated behaviour can change overnight without any code change on your side.
The honest rule we give every client: be sceptical of any vendor who claims their model "does not hallucinate". None of them are immune. The question is never whether a system can hallucinate, only how often, under what conditions, and whether you would catch it before it reached a customer. That reframing, from elimination to measured control, is what separates a serious deployment from a demo.
You test for hallucinations before release with a three-tier strategy: automated evaluation against a labelled golden set, adversarial red-teaming, and human expert review of high-risk outputs. No single tier is sufficient on its own, and we run all three on every system before it touches a real user. The aim is a measured hallucination rate with a confidence interval, not a vague sign-off that "it seemed to work in the demo".
The golden set is the centrepiece. It is a curated collection of representative questions, each paired with the correct, source-grounded answer and the document that justifies it. We build it from real historical queries where possible, then expand it with synthetic and adversarial variants. A useful golden set for a medium-complexity enterprise assistant runs to 150 to 400 labelled items, covering common queries, known edge cases, out-of-scope questions that should be refused, and deliberately misleading prompts designed to bait the model into inventing.
| Golden-set field | Example value |
|---|---|
| id | gs-0142 |
| question | "What is your refund window for damaged goods?" |
| expected_answer | "14 days from delivery for damaged items." |
| source_doc | returns-policy-v4.pdf, clause 3.2 |
| category | policy / grounded |
| risk_tier | medium |
| should_refuse | false |
Tier one runs every golden-set item through the system automatically and scores each answer with a rubric. We score on four axes: factual correctness, grounding (is every claim traceable to the source), instruction compliance (did it refuse when it should), and citation validity. Each axis is scored by a combination of exact-match checks where possible and an LLM-as-judge model where the answer is free text, with a sample of judge outputs verified by a human to keep the judge itself honest.
Tier two is adversarial red-teaming. A human or an automated prompt generator deliberately tries to break the system: leading questions, false premises, prompt injection, requests for information outside the knowledge base, and emotionally manipulative framing. This is where you find the failures your polite golden set never surfaces. Tier three is expert human review, reserved for high-risk domains, where a qualified person reviews a sample of outputs against professional standards.
Our scoring rubric in practice:
We treat the hard-fail rate as the headline hallucination metric. Allocate 30 to 40% of total AI project time to this testing and mitigation work. Teams that skimp here ship fast and then spend the next quarter firefighting trust incidents, which is slower and far more expensive.
You monitor hallucination in production by continuously sampling live traffic, scoring a representative slice of it against the same rubric used pre-release, and alerting when any metric drifts beyond its threshold. Pre-release testing tells you the system was safe at launch; only live monitoring tells you it still is. Because scoring every request is expensive, we sample intelligently rather than exhaustively.
Our sampling policy scores 100% of high-risk requests (anything touching money, legal, health, or account changes), a fixed percentage of medium-risk traffic, and a smaller random sample of low-risk chat. On top of that we score any request the user flagged, any answer the model itself expressed low confidence about, and any conversation that ended in escalation or abandonment. This concentrates evaluation budget where the damage would be greatest.
| Metric | What it measures | Alert threshold (medium-risk) |
|---|---|---|
| Hard-fail hallucination rate | Confident wrong or ungrounded answers | Above 2% |
| Grounding / citation coverage | Share of claims traceable to source | Below 95% |
| Refusal correctness | Correct "I do not know" behaviour | Below 90% |
| Retrieval relevance | Quality of the retrieved context | Below 0.8 mean score |
| User flag rate | Human-reported bad answers | Above 1% |
We instrument all of this with observability tooling built for LLM systems. Langfuse, Helicone, Arize Phoenix and Galileo all let you trace a request end to end, log the retrieved context alongside the final answer, and run continuous evaluation scores against live traffic. Our default stack is Langfuse for tracing plus a custom scoring service, because it self-hosts cleanly inside a UK data boundary, which matters for clients with data residency requirements.
The most important and most neglected control is the CI/CD regression gate. Every change to the prompt, the retrieval config, the chunking, or the model version reruns the full golden set automatically, and the release is blocked if the hard-fail rate rises above the previous baseline. This is what stops a "small prompt tweak" silently reintroducing a class of hallucination you fixed three months ago. A simplified version of the gate logic we deploy:
We also watch for model drift explicitly. When a provider announces a new default model, we pin the old version, run the new one through the full eval in shadow, compare the two, and only switch once the new model passes. Treating the model as an unversioned dependency is one of the most common mistakes we see in enterprise AI automation projects, and it is entirely avoidable.
The mitigation architecture that actually works is a layered one: strong retrieval-augmented generation as the foundation, disciplined prompt and context engineering on top, hard agent constraints around the edges, and structured input design at the user interface. No single technique is a silver bullet, and any vendor selling one is overselling. The compounding effect of several modest controls is what gets a system from "impressive demo" to "safe in production".
Retrieval-augmented generation, done properly, is the highest-leverage control. The key word is properly. Naive RAG that bolts a vector search onto a model often makes grounding worse by feeding irrelevant chunks. What works is hybrid retrieval that combines keyword and semantic search, clean and well-chunked source data, metadata filtering so the model only sees in-scope documents, and a validation layer that checks whether the retrieved context actually contains an answer before the model is allowed to respond. If it does not, the system refuses rather than guesses.
| Control layer | Technique | Primary failure type addressed |
|---|---|---|
| Retrieval | Hybrid search, clean chunking, metadata filters | Grounding, factual |
| Validation | "Answer present in context?" gate before responding | Grounding, citation |
| Prompt / context | Strict instructions, source quoting, refusal templates | Instruction, factual |
| Agent constraints | Tool whitelists, scope limits, output schemas | Instruction, scope creep |
| UX design | Structured inputs, suggested questions, confidence display | Ambiguity-driven failures |
| Human-in-loop | Escalation on low confidence or high risk | All - safety net |
On the prompt side, we instruct the model to answer only from the supplied context, to quote the source clause where it makes a factual claim, and to use an explicit refusal template when the context is insufficient. We require citations as structured output and then validate every citation against the actual source set, discarding any the model invented. This single check kills the citation-hallucination class almost entirely.
For agentic systems, hard constraints matter more than clever prompting. We whitelist the tools an agent can call, cap the number of reasoning steps, enforce a JSON output schema, and run a final validation pass before anything reaches the user. When we build AI voice agents or automated business process flows, the agent is never trusted to self-police: a deterministic layer checks its output against business rules before any action is taken.
Finally, the user interface is a hallucination control in its own right. Replacing a blank free-text box with structured inputs, suggested questions, and dropdowns removes ambiguity at source, and showing a confidence indicator plus a clear "talk to a human" path means the rare bad answer does not become a bad outcome. Our before-and-after on a recent client deployment:
| Metric | Before mitigation | After layered controls |
|---|---|---|
| Hard-fail hallucination rate | 9.1% | 1.4% |
| Grounding coverage | 71% | 97% |
| Fabricated citations | 1 in 14 answers | Under 1 in 500 |
| Correct refusals | 52% | 94% |
Those are real measured figures from a single project, not a brochure claim, and the heavy lifting came from retrieval quality and the answer-present-in-context gate, not from swapping to a bigger model.
You set acceptable hallucination thresholds by domain risk: the higher the consequence of a wrong answer, the lower the tolerated rate and the more human oversight required. A 2% hard-fail rate that is perfectly acceptable for an internal IT helpdesk bot is wholly unacceptable for a tool giving medication or legal guidance. There is no universal "safe" number; there is only a number agreed against the specific harm a failure would cause.
We classify every use case into a risk tier at the start of a project and let that tier drive the threshold, the sampling rate, the mitigation budget, and whether a human must sit in the loop. The base hallucination rates by domain make the case for differentiation on their own: general-knowledge tasks sit under 1%, but legal questions climb past 6%, medical reasoning runs 10 to 20%, and some retrieval-augmented legal tools have been measured failing up to a third of the time. You cannot apply one standard across that spread.
| Risk tier | Example use case | Max hard-fail target | Human in loop? |
|---|---|---|---|
| Low | Internal FAQ, content drafting | Under 3% | Optional |
| Medium | Customer support, lead qualification | Under 2% | On low confidence |
| High | Financial figures, account changes | Under 0.5% | Mandatory review |
| Critical | Legal, medical, safety advice | Near zero, assisted only | Always - AI assists, human decides |
Our stance here is deliberately conservative, and we will say no to clients who want it otherwise. For critical-tier use cases we do not deploy autonomous AI that gives a final answer to an end user. We deploy AI as an assistant that drafts, retrieves, and summarises while a qualified human makes and owns the decision. The technology is not yet good enough to carry unsupervised liability in law or medicine, and pretending otherwise is how organisations end up in front of a regulator.
Setting the tier is also a governance act, not just an engineering one. It determines your audit-logging depth, your incident-response obligations, and, under UK law, whether the use case counts as high-risk automated decision-making with the extra duties that brings. We document the tier, the agreed threshold, and the justification in writing at the start, so there is never an argument later about what "good enough" meant.
UK law requires that you can explain automated decisions, that individuals subject to significant automated decisions have rights to human review, and that high-risk uses carry specific safeguards, all of which the Data (Use and Access) Act 2025 reshaped from the position under UK GDPR Article 22. For any UK enterprise deploying AI that affects customers, hallucination control is not just good engineering: it is part of demonstrating the lawfulness, fairness and transparency the ICO expects.
The DUAA 2025, with provisions expected to take effect through 2026, relaxes some of the previous near-blanket restriction on solely automated decision-making, allowing it more broadly while requiring guardrails for high-risk uses, particularly where special-category data is involved. The practical effect is that you have more freedom to automate, but more responsibility to evidence that the automation is reliable, explainable, and subject to human challenge. A system that hallucinates and cannot show its working fails that test on multiple fronts.
The ICO's own guidance is the backbone of how we map controls to obligations. The "Explaining decisions made with AI" guidance, produced with the Alan Turing Institute, and the ICO's updated automated decision-making material set out what regulators expect. Here is how we translate the duties into engineering artefacts:
| UK obligation | What it requires | How our controls satisfy it |
|---|---|---|
| Explainability (ICO / Turing) | Show how an output was reached | Logged retrieval context, citations, trace per request |
| Right to human review (DUAA / Art 22) | Human can review significant decisions | Escalation paths, human-in-loop on high risk |
| Accuracy principle (UK GDPR) | Personal data and outputs kept accurate | Golden-set evals, drift monitoring, refusal on uncertainty |
| Accountability | Demonstrate compliance with records | Audit logs, versioned eval results, incident SOP |
Two trends underline why this matters commercially as well as legally. Responsible-AI governance roles grew about 17% in 2025, and the share of firms operating with no responsible-AI policy at all fell from 24% to 11%. Buyers and regulators increasingly expect a documented position. Our incident-response standard operating procedure is simple and worth stating plainly: every confirmed material hallucination is logged with the input, the retrieved context, the output, and the user impact; it is triaged by severity; a fix or rollback is applied; a regression test is added to the golden set so the same failure cannot recur; and high-severity cases are reported to the client's data protection lead. That paper trail is exactly what an ICO enquiry would ask to see.
Our honest view: most UK businesses do not need to fear the regulator if they engineer responsibly. The organisations that get into trouble are the ones who deployed an unmonitored black box, kept no logs, and could not explain a single decision. Do the testing, keep the records, and compliance largely follows from good engineering rather than fighting against it.
The Softomate implementation process for hallucination-safe AI is a five-stage programme that takes a typical medium-complexity enterprise system from scoping to monitored production in eight to fourteen weeks, with a fixed quote agreed before any build work begins. We do not bill open-ended day rates for this kind of work, because the testing and mitigation effort is exactly the part clients most need certainty on. You get a defined scope, a defined threshold, and a defined price.
We are a London-based AI automation and software development agency in Stanmore (HA7), and every engagement starts by agreeing what "acceptable" means for your specific use case. The five stages:
| Stage | Typical duration | Key deliverable |
|---|---|---|
| Discovery and risk tiering | 1 to 2 weeks | Risk tier, threshold, failure taxonomy |
| Golden set and eval harness | 1 to 2 weeks | Labelled test set, baseline metrics |
| Build with mitigation | 3 to 6 weeks | Working system below threshold |
| Red-team and review | 1 to 2 weeks | Adversarial report, CI gate |
| Monitored launch | 1 to 2 weeks | Live monitoring, compliance pack |
On pricing, a focused hallucination-testing and mitigation engagement on an existing AI system starts at around £6,500. A full build of a production-grade, monitored, compliant AI assistant or custom system integration typically starts at around £14,000 and scales with risk tier and integration complexity. Ongoing monitoring, drift tracking, and golden-set maintenance is available from around £900 a month. Every figure is confirmed as a fixed quote after discovery, so there are no surprises mid-project. If you already have an AI feature that "mostly works but sometimes embarrasses us", the testing-only engagement is usually where we start.
What you will not get from us is a promise that hallucinations will reach zero. What you will get is a measured rate, held below an agreed threshold, with the monitoring, audit trail, and incident process to keep it there and to prove it to a regulator or a board.
No. Large language models are probabilistic, so any system can hallucinate under some condition. The realistic goal is a measured rate held below an agreed threshold, with monitoring to catch the rare failure before it reaches a customer. Anyone promising zero hallucination is overselling.
For a medium-risk support bot we target a hard-fail rate under 2% on the golden set, with grounding coverage above 95% and correct refusals above 90%. Higher-risk uses touching money or health demand far stricter thresholds and mandatory human review.
RAG reduces them substantially when done well, but naive RAG can make grounding worse by feeding irrelevant context. It needs hybrid retrieval, clean chunking, metadata filtering, and a validation gate that refuses to answer when the retrieved context does not actually contain the answer.
Allocate 30 to 40% of total project time to testing and mitigation. Teams that skip this ship faster but spend the following quarter firefighting trust incidents, which costs more in money, reputation, and regulatory exposure than doing it properly the first time.
We use LLM observability platforms such as Langfuse, Helicone, Arize Phoenix, and Galileo to trace requests, log retrieved context, and run continuous scoring. Our default is self-hosted Langfuse plus a custom scoring service, which keeps data inside a UK boundary for residency-sensitive clients.
Yes. Inaccurate automated outputs can breach the accuracy principle and the rules on automated decision-making. The DUAA 2025 allows more automation but requires guardrails for high-risk uses, plus explainability and human review. Good hallucination control is part of demonstrating lawful, fair processing.
A golden set is a labelled collection of representative questions, each with the correct source-grounded answer and the document that justifies it. It is the benchmark every release is scored against, turning hallucination from a vague worry into a measurable rate you can gate releases on.
When a provider ships a new model behind the same API, validated behaviour can change overnight with no code change on your side. We pin model versions, shadow-test new ones against the full golden set, and only switch once they pass, so updates never silently reintroduce failures.
Not always, but it depends on risk tier. Low-risk content drafting can run autonomously; medium-risk uses escalate to a human on low confidence; high and critical uses such as legal, medical, or financial advice always keep a qualified human making the final decision while the AI assists.
A focused testing and mitigation engagement typically delivers measurable improvement within two to four weeks. On a recent project we cut the hard-fail rate from 9.1% to 1.4% mainly through better retrieval and an answer-present-in-context gate, without swapping to a larger or more expensive model.
AI hallucination is an engineering defect with a measurable rate, not an unavoidable mystery. The systems that stay trustworthy in production share the same discipline: a labelled golden set, a three-tier testing strategy of automated evals, red-teaming and expert review, continuous live monitoring with CI regression gates, and a layered mitigation architecture led by strong retrieval and an answer-present-in-context check. Set thresholds by domain risk, hold medium-risk systems under a 2% hard-fail rate, keep critical-tier decisions in human hands, and tie every control to UK ICO explainability duties and the DUAA 2025 safeguards. Done well, we routinely take hallucination rates from around 9% down to under 1.5% and keep them there with monitoring and a documented incident process. The organisations that win with AI are not the ones chasing zero hallucination; they are the ones who measure, mitigate, monitor, and can prove it. That evidence is what turns a risky pilot into a system your board and your regulator will both trust.
If you have an AI feature that mostly works but occasionally invents answers, or you are scoping a new deployment and want it hallucination-tested from day one, talk to us about our AI automation and testing services in London or get in touch for a fixed-quote assessment.
Written by Deen Dayal Yadav, Founder of Softomate Solutions, a London-based AI automation and software development agency in Stanmore (HA7). With over 12 years building software and automation systems for UK businesses, Deen leads the team that designs, tests, and ships production AI assistants, voice agents, and process automation with measured reliability and full audit trails. Softomate Solutions is registered at Companies House and works with clients across London and the UK to deploy AI that is accurate, explainable, and compliant. Learn more about our team and approach.
We protect the real names of all clients featured in examples and case studies. Every testimonial is from a real client.
Work with us
Book a free 30-minute discovery call with DD and get a personalised automation roadmap.
Deen Dayal Yadav
Online
We use essential cookies to keep the site running. With your permission, we also use analytics cookies to understand how visitors use our site so we can improve it. No data is sold. Privacy Policy