AI & Automation Services
Automate workflows, integrate systems, and unlock AI-driven efficiency.




RAG (Retrieval-Augmented Generation) retrieves live information from a connected knowledge base each time a user asks a question - it costs less, stays current automatically, and requires no retraining when your documents change. Fine-tuning bakes knowledge into model weights at training time - it costs more upfront (typically £3,000-£8,000), requires a minimum of 500 high-quality training examples, and needs a full retraining cycle to update. For most UK businesses, RAG is the right starting point: implementation costs £1,500-£4,000, integrates with existing documents and CRM data within days, and updates take hours rather than weeks. Fine-tuning suits businesses with highly specialised language, regulated terminology, or proprietary methodologies that appear consistently across thousands of customer interactions - think FCA compliance scripts or NHS clinical triage language.
Last updated: 18 May 2026
Published 18 May 2026RAG, or Retrieval-Augmented Generation, connects a large language model to an external knowledge store. Instead of relying solely on what was baked into the model during pre-training, the chatbot retrieves the most relevant document chunks at query time and feeds them into the prompt as context - so the model generates its answer from your actual, current data.
When we build RAG systems for clients here at Softomate, the pipeline has three distinct stages: document ingestion, retrieval, and generation. Understanding each stage helps you make an informed decision about whether RAG fits your business requirements.
During ingestion, your source documents - PDFs, Word files, database exports, web pages, CRM records - are split into chunks, typically 256 to 512 tokens each, and passed through an embedding model. The embedding model converts each chunk into a high-dimensional numerical vector that captures its semantic meaning. These vectors are stored in a vector database such as Pinecone or Weaviate.
We use different embedding models depending on client requirements. For most commercial UK business chatbots, OpenAI's text-embedding-3-large model or Cohere's embed-english-v3.0 deliver excellent results at a manageable cost of roughly £0.0001 per 1,000 tokens. For clients with data residency requirements, we run embedding models on-premises using open-source alternatives such as BGE-large or E5-large, deployed on Azure UK South or AWS eu-west-2.
When a user submits a query, the same embedding model converts that query into a vector. The vector database then performs an approximate nearest-neighbour search - finding the document chunks whose vectors sit closest to the query vector in semantic space. We typically retrieve the top five to ten chunks, re-rank them using a cross-encoder model for precision, and inject them into the prompt as context.
The choice of vector database matters for production deployments. Pinecone is fully managed, scales to hundreds of millions of vectors, and costs roughly £70-£200 per month for most UK business workloads. Weaviate offers a self-hosted option that suits clients with strict data sovereignty requirements - we have deployed Weaviate on Azure UK South for a regulated financial services client where no data could leave UK infrastructure.
With retrieved context injected, the language model - whether GPT-5.4, Claude 4, or a self-hosted open-source model - generates a grounded, specific answer. Crucially, the model is instructed to answer only from the provided context and to signal when it cannot find a relevant answer rather than hallucinating one. This citation constraint is one of the most important safety features in a production RAG system.
Frameworks such as LangChain and LlamaIndex handle orchestration between these stages, managing prompt construction, chunk injection, memory, and tool calls. LlamaIndex is particularly strong for document-heavy RAG pipelines - its connector ecosystem handles everything from SharePoint to Notion to SQL databases. LangChain is our default for conversational agents that need tool use alongside retrieval.
RAG's defining strength is that you can update your knowledge base - re-embed new documents, push them to the vector store - without touching the underlying language model. A property management company we work with adds new tenancy legislation documents as they are published; the chatbot reflects the update within hours. That would require a full retraining cycle with fine-tuning, taking days and costing thousands of pounds each time.
Fine-tuning continues the training of a pre-trained language model on a curated dataset of examples specific to your domain. The model's weights are updated so it learns your company's tone, terminology, and response patterns - not just at prompt time, but permanently baked into the model itself.
The distinction matters practically. A RAG chatbot consults your documents at runtime - it is still the base model underneath. A fine-tuned model has been reshaped by your data at the weight level. It will naturally write in your style, use your exact terminology, and handle your edge cases without needing those patterns injected via retrieval every single time.
Supervised fine-tuning requires a training dataset of input-output pairs: a question and the ideal answer, a support ticket and the ideal resolution, a legal query and the FCA-compliant response. The model is trained on these pairs using gradient descent - the same mechanism as original pre-training, but starting from a much more capable base and using far fewer steps.
Full fine-tuning updates all model parameters, which is computationally expensive - for a 7B parameter open-source model, you are looking at £500-£2,000 in GPU compute alone. LoRA (Low-Rank Adaptation) and its quantised variant QLoRA dramatically reduce this cost by training only a small set of adapter matrices while keeping the base model frozen. A LoRA fine-tune on a 13B model can run on a single A100 GPU in four to eight hours, costing roughly £50-£150 in compute - though the overall project cost includes data preparation, evaluation, deployment, and ongoing maintenance.
For clients using the OpenAI API, GPT-5.4 fine-tuning via the fine-tuning endpoint removes infrastructure concerns entirely. You upload your JSONL training file, OpenAI handles the compute, and you get a custom model ID back - typically within two to six hours for datasets under 10,000 examples. Pricing is based on tokens processed during training plus a per-token inference premium on your custom model.
This is where fine-tuning projects most commonly stall. Clients often underestimate the volume and quality of labelled data required. Our minimum thresholds from project experience:
Data quality matters more than quantity. A dataset of 300 carefully reviewed, diverse, expert-written examples will outperform 3,000 noisy, repetitive ones. We spend 30-40% of a fine-tuning project budget on data curation, cleaning, and review - clients who skip this step end up with a model that overfits to their most common query patterns and fails on anything unusual.
| Scenario | RAG better | Fine-tuning better |
|---|---|---|
| Knowledge base updated frequently (weekly or more) | Yes - update vector store, no retraining | No - retraining cost prohibitive |
| Highly specific regulatory language (FCA, CQC, SRA) | Partial - can inject via retrieval | Yes - internalises compliant phrasing permanently |
| Unique brand voice across thousands of interactions | Partial - system prompt helps but inconsistent | Yes - model adopts tone naturally |
| Factual Q&A from large document corpus | Yes - citations, accuracy, currency | No - cannot retrieve facts it was not trained on |
| Proprietary methodology or scoring framework | Partial - if documented fully | Yes - if applied implicitly across many contexts |
| Low-latency requirements (under 500ms) | No - retrieval adds 200-400ms | Yes - no retrieval step required |
| Limited training data available (under 200 examples) | Yes - no training data required | No - insufficient for reliable results |
A financial services client of ours needed a chatbot that consistently used FCA-compliant phrasing for investment risk disclosures - phrasing that had to be worded precisely, not approximately. That was a fine-tuning case. The required language appeared in hundreds of interactions per day, the phrasing was highly specific, and even small deviations created compliance risk. RAG could have injected the right phrases via retrieval, but the inconsistency risk was unacceptable to their compliance team. Fine-tuning gave them a model that applied compliant language naturally, at speed, without needing perfect retrieval every time.
For most UK businesses comparing the two approaches, RAG costs £1,500-£4,000 to implement and £70-£300 per month to run, while a fine-tuning project runs £3,000-£8,000 upfront with lower ongoing inference costs but significant retraining expense each time your knowledge changes.
The cost comparison is rarely straightforward because the two architectures have different cost structures across their lifecycles. RAG front-loads infrastructure setup but stays relatively cheap to update. Fine-tuning has lower ongoing inference costs but turns knowledge updates into expensive events. Below is the full breakdown based on our project cost data.
| Cost factor | RAG implementation | Fine-tuning project |
|---|---|---|
| Initial build cost | £1,500-£4,000 (ingestion pipeline, vector DB, retrieval logic, API integration) | £3,000-£8,000 (data curation, training run, evaluation, deployment) |
| Monthly running cost | £70-£300 (vector DB hosting, embedding API calls, LLM inference) | £30-£150 (LLM inference only, no retrieval overhead - but premium per-token rate on custom model) |
| Update cost when knowledge changes | £0-£500 (re-embed and push updated documents; can be fully automated) | £1,500-£5,000 per retraining run (data refresh, GPU compute, evaluation, redeploy) |
| Training data required | None - works from raw documents | 200-5,000+ labelled examples (data preparation: £500-£2,000 of total cost) |
| Time to update knowledge | Hours (automated pipeline) to 2 days (manual re-ingestion) | 2-6 hours for API fine-tuning; 1-4 weeks for full custom model retraining and evaluation |
| Latency per response | 800ms-2s (retrieval adds 200-400ms to generation time) | 400ms-1.2s (no retrieval step; faster response times) |
| Typical 12-month total cost | £3,340-£7,600 (build + 12 months running + two knowledge updates) | £6,360-£21,800 (build + 12 months running + two retraining cycles) |
Several costs do not appear in the initial quote that materially affect the total cost of ownership for both approaches.
For RAG, the biggest hidden cost is data quality preparation. If your documents are in poor shape - inconsistent formatting, duplicate content, outdated pages that contradict current policy - retrieval quality suffers and users get confusing answers. We typically spend £500-£1,500 on document cleanup and chunking strategy before a RAG system performs reliably. A property management client came to us with 2,000 tenancy FAQ entries, but 40% were outdated, contradictory, or duplicates. The data cleanup before ingestion was 20% of the total project cost.
For fine-tuning, the hidden cost is evaluation and red-teaming. A fine-tuned model that performs well on your training distribution can fail unpredictably on queries outside that distribution. Rigorous evaluation - building a diverse test set, running adversarial prompts, checking for bias amplification - typically adds £500-£1,500 to a fine-tuning project. Skip this and you risk deploying a model that fails in production in ways that damage customer trust.
Both approaches share the ongoing cost of prompt engineering and monitoring. Production AI chatbots need regular prompt review, failure analysis, and output auditing. Budget £200-£500 per month for this if you are managing it internally, or include it in a managed service contract with us.
The answer depends on four variables: how frequently your knowledge changes, how specialised your required language is, how much labelled training data you have, and what latency your users will accept. For the majority of UK SMEs and mid-market businesses, RAG is the right starting point - fine-tuning becomes justified when language specificity requirements cannot be met reliably via retrieval alone.
Rather than abstract principles, here is a practical use-case decision guide drawn from projects we have delivered across professional services, property, healthcare, and financial services sectors.
| Use case | Best architecture | Why | Estimated cost |
|---|---|---|---|
| Customer FAQ chatbot (e-commerce, SaaS, hospitality) | RAG | FAQ content changes regularly; retrieval from product/policy docs gives accurate, citable answers; no specialised language required | £2,000-£4,000 build |
| Legal advice pre-screening bot (SRA-regulated firm) | Fine-tuning or hybrid | Precise legal terminology required consistently; SRA conduct rules must not be paraphrased; latency matters for UX | £6,000-£15,000 |
| Product recommendation assistant (retail, B2B) | RAG | Product catalogue changes frequently; personalisation driven by retrieval from user history and product data; no specialised language needed | £3,000-£6,000 build |
| HR policy and onboarding bot | RAG | Policy documents are the ground truth; updates must propagate immediately; employees need cited answers, not paraphrases | £1,500-£3,000 build |
| Medical triage assistant (CQC-registered provider) | Hybrid (RAG + fine-tuning) | Requires both current clinical guidance (RAG from NHS/NICE docs) and consistent clinical language and safety phrasing (fine-tuning); highest stakes of all use cases | £12,000-£25,000+ |
| Financial product explainer (FCA-authorised firm) | Fine-tuning | Risk warning language, suitability phrasing, and fair treatment language must be precise and consistent across every interaction; retrieval inconsistency creates compliance exposure | £5,000-£10,000 |
| Internal knowledge base assistant (professional services) | RAG | Accesses internal documents, case files, precedents; knowledge base is large and evolves; source citations essential for professional trust | £3,000-£7,000 build |
| Multilingual customer support (international UK businesses) | RAG with multilingual embedding | Cross-lingual retrieval using multilingual-e5 or Cohere multilingual embeddings; base model handles generation; easier to maintain than language-specific fine-tunes | £4,000-£8,000 build |
When a client comes to us uncertain about which architecture to choose, we work through three diagnostic questions before recommending anything.
How often does your knowledge change? If your product catalogue, pricing, policies, or regulatory guidance changes more than once per quarter, RAG is almost always the right choice. The cost of retraining a fine-tuned model quarterly (£1,500-£5,000 per cycle) typically exceeds the cost of running a RAG system for an entire year.
Do you have a language problem or a knowledge problem? A language problem means the chatbot needs to phrase things a very specific way - FCA-compliant risk warnings, clinical triage scripts, legal disclaimers. A knowledge problem means the chatbot needs access to information - your product specs, your FAQ library, your policy documents. RAG solves knowledge problems directly. It can partially solve language problems via carefully engineered system prompts, but fine-tuning solves language problems more reliably and at scale.
Can you assemble 500+ high-quality labelled examples? If the answer is no, fine-tuning is off the table regardless of how appealing it sounds in theory. We have seen clients spend £2,000-£3,000 on a fine-tuning project only to abandon it when they discovered their historical chat transcripts were too low-quality, too repetitive, or too small to produce a reliable model. Honest data assessment before project start saves significant money.
Yes - and for high-stakes, high-volume deployments this hybrid approach is often the right answer. A fine-tuned model handles consistent language, tone, and domain terminology, while RAG provides grounded, current knowledge. The two techniques address different weaknesses of a base model and complement each other rather than compete.
The hybrid architecture works by fine-tuning the language model first - training it on your specific language patterns, compliance requirements, or branded tone - and then deploying that fine-tuned model as the generator inside a RAG pipeline. The model still retrieves current document context at query time, but it generates responses in the precise style and terminology your use case demands.
The additional build cost for a hybrid system is real - typically £8,000-£20,000 for a production deployment depending on model size, data requirements, and infrastructure complexity. That cost is justified in specific scenarios:
We built a hybrid system for a specialist healthcare staffing agency that places clinical staff across NHS and private hospital settings. Their chatbot needed to handle two distinct requirements simultaneously: it had to retrieve current shift availability, pay rates, and compliance documentation in real time (a knowledge problem that RAG solves well), and it had to communicate in the precise language NHS procurement teams expect, with correct banding references, IR35 status language, and CQC registration terminology (a language problem that fine-tuning solves well).
We fine-tuned a GPT-5.4 model on 1,400 examples drawn from their account manager email transcripts and compliance documentation, then deployed it as the generator inside a LlamaIndex RAG pipeline that connected to their shift management database via API. The result was a chatbot that retrieved live availability data with citation accuracy while communicating with the register-appropriate language their NHS clients expected. The project cost £14,500 and reduced their account management team's inquiry handling time by 60%.
That result would not have been achievable with RAG alone (the language consistency was not reliable enough) or fine-tuning alone (the model would have had no access to real-time shift data). The hybrid approach was the correct engineering choice - and the business case supported the higher build cost.
For factual and time-sensitive information, RAG is more accurate because it retrieves answers directly from your source documents rather than relying on memorised training data. Fine-tuning is more consistent for style, terminology, and compliance language. A hybrid system - a fine-tuned model inside a RAG pipeline - produces the most accurate results overall but costs £8,000-£20,000 to build and is only justified for high-stakes, high-volume deployments.
GPT-5.4 fine-tuning via the OpenAI API typically takes 2-6 hours for datasets under 10,000 examples, with no infrastructure management required on your side. Fine-tuning a self-hosted open-source model (such as Llama 3.1 or Mistral) on your own GPU infrastructure takes 4-24 hours depending on dataset size and model parameter count. Building and evaluating a fully custom model from a base checkpoint with extensive human feedback can take 2-8 weeks and is rarely necessary for UK business chatbot use cases.
When your documents change, you re-embed the updated files and push the new vectors to your vector database - the underlying language model is untouched. This process can be fully automated: your document pipeline detects file changes, runs the embedding job, and updates Pinecone or Weaviate automatically. For most clients, knowledge updates propagate within 2-4 hours of the source document changing. Contrast this with fine-tuning, where a knowledge update requires a full retraining cycle costing £1,500-£5,000 and taking days to complete.
Yes, with appropriate data residency and access controls in place. We deploy RAG systems for FCA-authorised and CQC-registered clients using Azure UK South or AWS eu-west-2 regions, ensuring all data - source documents, vectors, and conversation logs - remains within UK infrastructure in compliance with UK GDPR. Vector databases such as Weaviate support self-hosted deployment within your own Azure or AWS tenant. The key compliance considerations are data residency, access logging, retention periods, and right-to-erasure workflows - all of which are solvable with the right architecture design from the outset.
A minimum of 200-500 high-quality, diverse, expert-reviewed input-output examples is required to see meaningful improvement over a base model. For consistent style and terminology across a wide range of query types, budget for 1,000+ examples. The most common sources are historical support ticket resolutions, email transcripts with clients, annotated policy documents, and manually written Q&A pairs reviewed by a subject matter expert. Data quality matters more than quantity - 300 carefully curated examples will outperform 3,000 noisy or repetitive ones from a poorly filtered export.
RAG costs £1,500-£4,000 to implement and handles knowledge updates in hours - for the vast majority of UK business chatbot projects, that combination of low cost and high adaptability makes it the right default choice. Fine-tuning becomes justified when language precision requirements cannot be met reliably through prompt engineering and retrieval alone: regulated terminology, FCA-compliant risk language, or NHS clinical phrasing that must be consistent across tens of thousands of interactions. Hybrid systems deliver the best of both approaches at £8,000-£20,000, but that investment requires a clear business case. Start with RAG, measure what breaks, and add fine-tuning only where retrieval-based language consistency genuinely falls short.
Not sure whether your chatbot project needs RAG, fine-tuning or a hybrid approach? Talk to the Softomate AI team - we will review your use case, existing data assets and budget and recommend the right architecture before you commit to a build.
Written by the Softomate Solutions AI Development Team, Barking, East London. We build custom AI chatbots using RAG and fine-tuning architectures for UK businesses across professional services, property, healthcare and financial services.AI chatbot development costs in the UK range from £3,000 for a simple FAQ chatbot to £25,000+ for a fully integrated conversational AI with CRM and booking system integration. Monthly running costs are typically £100-£400. Softomate Solutions builds AI chatbots from £3,500 with a 3-4 week delivery timeline and full UK GDPR configuration included.
For customer-facing use, a custom AI chatbot trained on your specific business knowledge, pricing and services significantly outperforms a generic ChatGPT integration. A custom chatbot knows your products, your pricing, your service area and your compliance requirements. It also integrates with your CRM, booking system and WhatsApp - capabilities ChatGPT plugins cannot replicate without custom development.
Let us help
Talk to our London-based team about how we can build the AI software, automation, or bespoke development tailored to your needs.
Deen Dayal Yadav
Online