What is RAG (Retrieval-Augmented Generation)?

RAG — Retrieval-Augmented Generation — is the AI architecture that allows a language model to answer questions about your specific documents. Instead of relying only on its training data, the AI retrieves relevant content from your knowledge base (employee handbooks, contracts, product specs) before generating a response. This is how enterprise AI assistants answer accurately about your proprietary data without retraining.

What is the difference between RAG and fine-tuning?

RAG retrieves relevant documents at query time and passes them to the model as context — no retraining required. Fine-tuning bakes knowledge into the model weights by retraining on your data. RAG is faster to deploy (two to four weeks vs. months), cheaper to update (just update the document store), and better for factual Q&A. Fine-tuning is better for style, tone, or domain-specific language patterns.

What types of documents can a RAG system use?

A RAG system can use any text-based content: PDFs, Word documents, web pages, Notion pages, Confluence wikis, CSV files, support tickets, and email threads. The documents are chunked, embedded as vectors, and stored in a vector database. At query time, the most relevant chunks are retrieved and passed to the language model as context for its answer.

How long does it take to build a RAG system?

A basic RAG system — document ingestion, vector store, and a chat interface — can be deployed in two to four weeks. A production-grade RAG system with access controls, source citations, feedback loops, and integration with business tools typically takes six to twelve weeks. Ongoing optimisation of chunking strategy and retrieval accuracy continues after launch.

When should a business invest in RAG instead of a standard chatbot?

Invest in RAG when your AI assistant needs to answer questions accurately about your specific data — internal policies, product documentation, legal contracts, or historical support cases. A standard chatbot uses scripted or pattern-matched responses and cannot reason over novel queries in your documents. RAG is the right architecture when accuracy on your proprietary content matters more than scripted predictability.

What Is RAG? How AI Companies Build Smarter Search | Codalyst Tech Blog

RAG — Retrieval-Augmented Generation — is the architecture behind AI assistants that answer questions about your specific documents. When an AI system reads your employee handbook, legal contracts, or product knowledge base and returns accurate answers, it uses RAG to retrieve relevant content before generating a response. RAG is the most practical way to build an AI that knows your proprietary data without retraining a foundation model from scratch — and it can be deployed in production in two to four weeks.

This article explains how RAG works step by step, when it makes sense to build one, and how it compares to the alternatives.

The problem RAG solves

Large language models like GPT-4 or Claude are trained on vast amounts of internet text. They are remarkably good at reasoning, writing, and answering general questions. But they have two hard limitations:

Knowledge cutoff. They only know what was in their training data, which was frozen at a specific date.
No private knowledge. They have never read your internal documents, your customer records, your proprietary research, or your product documentation.

The naive solution is to paste your documents into the model's context window — and this works, for small document sets. But context windows have limits, and more importantly, searching 10,000 documents by stuffing them all into a prompt is slow, expensive, and impractical.

RAG solves this by teaching the AI to retrieve relevant documents first, then generate an answer based only on what it retrieved.

How RAG works — step by step

Step 1 — Indexing (done once, updated as documents change)

Your documents (PDFs, Word files, database records, web pages, Notion pages) are broken into chunks — typically 500–1,000 words each.

Each chunk is passed through an embedding model, which converts it into a list of numbers called a vector. Semantically similar text produces similar vectors, even if the exact words differ. "Dog bite" and "canine attack" will have similar vectors; "accounting policy" and "football results" will be far apart.

These vectors are stored in a vector database — a specialised database designed for fast similarity search. Popular options include Pinecone, Weaviate, Qdrant, and pgvector (a PostgreSQL extension).

Step 2 — Retrieval (happens at query time)

When a user asks a question ("What is our refund policy for digital products?"), the question is also converted into a vector using the same embedding model.

The vector database performs a nearest-neighbour search — it finds the chunks whose vectors are closest to the question vector. This is semantic search: it finds the most relevant content, not just content containing the exact keywords.

The top three to seven chunks are returned as context.

Step 3 — Generation

The original question and the retrieved chunks are assembled into a prompt:

"Using only the following context, answer this question. If the answer is not in the context, say so.

Context: [retrieved chunks]

Question: [user's question]"

The LLM generates a response grounded in the retrieved context. Crucially, the model is instructed not to make things up — if the answer is not in the retrieved documents, it should say so.

Step 4 — Citation

Production RAG systems include the source document name and chunk location in the response, so users can verify the answer. This is what separates a hallucination-prone chatbot from a trustworthy knowledge assistant.

RAG vs fine-tuning — what is the difference?

Both RAG and fine-tuning are ways to make an LLM more knowledgeable about a specific domain. They are not the same thing and serve different purposes.

For most business use cases — internal knowledge search, customer support, document Q&A — RAG is the right starting point. Fine-tuning is best reserved for cases where you need the model to adopt a specific writing style, reason in a domain-specific way, or respond with structured formats the base model does not handle well.

What RAG systems are used for

Internal knowledge search. Employees ask questions and get answers from internal wikis, HR policies, technical docs, and past project reports — with source citations.

Customer support automation. A support bot that answers customer questions from product documentation and FAQs, escalating to a human only when confidence is low.

Legal and compliance review. Lawyers and compliance teams query case law, regulatory documents, or contracts without manually reading thousands of pages.

Sales enablement. Sales reps ask about competitor comparisons, product specs, or case studies and get instant, accurate answers from the company knowledge base.

Financial analysis. Analysts query earnings reports, regulatory filings, or internal financial data with natural language.

What makes a good RAG system

The difference between a demo that impresses and a system that works reliably in production is the following:

Chunking strategy

How you split documents matters enormously. Chunk too small and you lose context. Chunk too large and retrieval becomes noisy. A simple approach is fixed-size chunks with overlap (e.g., 800 tokens with 100-token overlap). More sophisticated systems use semantic chunking — splitting at natural paragraph or section breaks.

Embedding model choice

The embedding model determines how well the semantic search works. OpenAI's text-embedding-3-large, Cohere's embed-v3, and open-source models like bge-large-en-v1.5 all perform well. The right choice depends on your language requirements and latency tolerance.

Retrieval quality

Simple nearest-neighbour search is often not enough. Production systems typically combine:

Semantic search (vector similarity)
Keyword search (BM25, a traditional text search algorithm)
Re-ranking (a second model that scores retrieved chunks for relevance)

This hybrid approach — sometimes called HyDE or CRAG — significantly improves retrieval accuracy.

Evaluation harness

You cannot improve what you cannot measure. Build a test set of 50–100 representative questions with ground-truth answers. Measure retrieval accuracy (did the right chunk come back?) and answer quality (is the generated answer correct?). Run this before every release.

Guardrails

A production RAG system needs:

A confidence threshold below which it refuses to answer and escalates to a human
Topic filtering (the customer support bot should not answer questions about competitor pricing)
Logging of every query and response for audit and improvement

Is RAG right for your business?

RAG is worth considering if:

You have more than a few hundred internal documents that staff currently search manually
Your customer support team spends significant time answering the same questions
You have tried a general chatbot (like ChatGPT) and found it makes things up or cannot answer your specific questions
Your team loses productive time searching across multiple systems for information

RAG is not the right solution if:

Your document set is small enough to fit in a model's context window
You need the model to learn a new reasoning style (fine-tuning is better)
Your data changes so frequently that indexing cannot keep up

What it costs

A basic RAG system for internal use can be built in two to four weeks by an experienced AI engineer. The ongoing operational cost is typically $100–400/month for a small business, covering embedding API calls, vector database storage, and LLM inference.

A production-grade customer-facing RAG system with evaluation, re-ranking, and guardrails is a more involved build — typically four to eight weeks — but the return in support team time saved and customer experience improvement justifies it for businesses with significant support volume.

If you are curious whether RAG is the right tool for your use case, our AI engineering team can assess your document landscape, define the right architecture, and deliver a working prototype in under a month. Book a free scoping call.

Back to all articles