Large language models have a fundamental limitation: their knowledge is frozen at training time. Retrieval-Augmented Generation (RAG) solves this by combining LLMs with external knowledge retrieval, enabling accurate responses based on current information, private documents, or specialized data.
What Is RAG?
RAG is an architecture that retrieves relevant information from a knowledge base and includes it in the prompt before generating a response.
flowchart LR
A[User Query] --> B[Retrieval System]
B --> C[(Knowledge Base)]
C --> D[Relevant Documents]
D --> E[Augmented Prompt]
A --> E
E --> F[LLM]
F --> G[Response]
style B fill:#8b5cf6,color:#fff
style C fill:#3b82f6,color:#fff
style F fill:#10b981,color:#fff
Without RAG
User: What's our company's refund policy?
LLM: I don't have access to your company's specific policies...
With RAG
[Retrieved context: "Our refund policy allows returns within 30 days
of purchase. Items must be unused and in original packaging. Digital
products are non-refundable after download."]
User: What's our company's refund policy?
LLM: Based on your company policy, refunds are available within 30 days
of purchase. Items need to be unused and in original packaging. Note
that digital products cannot be refunded once downloaded.
The RAG Pipeline
1. Document Ingestion
Documents are processed and converted into searchable chunks:
flowchart TD
A[Raw Documents] --> B[Text Extraction]
B --> C[Chunking]
C --> D[Embedding]
D --> E[(Vector Database)]
style A fill:#f59e0b,color:#fff
style E fill:#3b82f6,color:#fff
Chunking Strategies:
| Strategy | Description | Best For |
|---|---|---|
| Fixed size | Split by character/token count | Simple documents |
| Sentence | Split at sentence boundaries | Narrative text |
| Paragraph | Split at paragraph breaks | Structured documents |
| Semantic | Split by topic/meaning | Complex content |
2. Query Processing
When a user asks a question:
async function processQuery(userQuery) {
// 1. Convert query to embedding
const queryEmbedding = await embedText(userQuery);
// 2. Search vector database for similar chunks
const relevantChunks = await vectorDB.search(queryEmbedding, {
topK: 5,
threshold: 0.7
});
// 3. Build augmented prompt
const context = relevantChunks.map(c => c.text).join('\n\n');
return buildPrompt(context, userQuery);
}
3. Prompt Construction
Combine retrieved context with the user's question:
Use the following context to answer the question. If the answer is not
in the context, say "I don't have enough information to answer that."
Context:
---
{retrieved_documents}
---
Question: {user_question}
Answer:
Embedding and Vector Search
Embeddings convert text into numerical vectors that capture semantic meaning:
flowchart LR
A["'What is machine learning?'"] --> B[Embedding Model]
B --> C["[0.23, -0.45, 0.12, ...]"]
D["'Explain ML algorithms'"] --> B
B --> E["[0.21, -0.43, 0.14, ...]"]
C --> F{Similar vectors}
E --> F
style B fill:#8b5cf6,color:#fff
style F fill:#10b981,color:#fff
Similarity Measures
| Measure | Description | Range |
|---|---|---|
| Cosine similarity | Angle between vectors | -1 to 1 |
| Euclidean distance | Straight-line distance | 0 to β |
| Dot product | Magnitude-aware similarity | -β to β |
RAG Best Practices
1. Chunk Size Optimization
Too small chunks:
- Lose context
- More retrieval noise
- Incomplete information
Too large chunks:
- Diluted relevance
- Token limit issues
- Slower processing
Sweet spot: 200-500 tokens with overlap
2. Add Metadata for Filtering
const document = {
text: "Our Q3 2024 revenue increased by 15%...",
metadata: {
source: "quarterly_report",
date: "2024-10-01",
department: "finance",
confidentiality: "internal"
}
};
// Filter retrieval by metadata
const results = await vectorDB.search(query, {
filter: {
department: "finance",
date: { $gte: "2024-01-01" }
}
});
3. Hybrid Search
Combine vector search with keyword search for better results:
flowchart TD
A[Query] --> B[Vector Search]
A --> C[Keyword Search]
B --> D[Semantic Results]
C --> E[Exact Match Results]
D --> F[Fusion/Reranking]
E --> F
F --> G[Final Results]
style F fill:#10b981,color:#fff
4. Context Window Management
function buildContext(chunks, maxTokens = 3000) {
let context = [];
let tokenCount = 0;
for (const chunk of chunks) {
const chunkTokens = countTokens(chunk.text);
if (tokenCount + chunkTokens > maxTokens) break;
context.push(chunk.text);
tokenCount += chunkTokens;
}
return context.join('\n\n---\n\n');
}
Advanced RAG Patterns
Query Transformation
Improve retrieval by reformulating queries:
Original query: "Why isn't it working?"
Transformed queries:
1. "Common errors and troubleshooting steps"
2. "Error messages and their solutions"
3. "Debugging guide for [product]"
Multi-Query RAG
Generate multiple query variations and merge results:
async function multiQueryRAG(originalQuery) {
// Generate query variations
const variations = await llm.generate(`
Generate 3 different ways to ask this question:
"${originalQuery}"
`);
// Retrieve for each variation
const allResults = await Promise.all(
[originalQuery, ...variations].map(q => retrieve(q))
);
// Deduplicate and rerank
return rerank(deduplicate(allResults.flat()));
}
Self-RAG (Critique and Refine)
Have the LLM evaluate retrieval quality:
Given this context and question, first evaluate:
1. Is the context relevant to the question? (yes/no)
2. Does the context contain enough information? (yes/no)
3. Is any information potentially outdated? (yes/no)
If all answers are "yes", provide your answer.
If not, explain what additional information is needed.
Context: {context}
Question: {question}
Common RAG Challenges
1. Retrieval Quality
| Problem | Solution |
|---|---|
| Irrelevant results | Improve chunking, add metadata filters |
| Missing information | Increase top-K, use hybrid search |
| Outdated content | Implement document versioning |
2. Hallucination Prevention
Answer based ONLY on the provided context.
If the information is not in the context, respond with:
"I cannot find this information in the available documents."
DO NOT use any external knowledge or make assumptions.
3. Citation and Attribution
When answering, cite your sources in this format:
- Use [1], [2], etc. for inline citations
- List sources at the end with document names
Context:
[1] company_policy.pdf: "Employees receive 20 vacation days..."
[2] hr_handbook.pdf: "Unused vacation days can be carried over..."
Summary
| Component | Purpose | Key Consideration |
|---|---|---|
| Chunking | Break documents into searchable units | Size and overlap |
| Embedding | Convert text to vectors | Model selection |
| Vector DB | Store and search embeddings | Scalability |
| Retrieval | Find relevant context | Precision vs recall |
| Prompt | Combine context with query | Clear instructions |
RAG transforms LLMs from static knowledge systems into dynamic tools that can work with your specific data. By combining the reasoning capabilities of LLMs with accurate information retrieval, you can build reliable AI applications for documentation, customer support, research, and more.
References
- Phoenix, James and Taylor, Mike. Prompt Engineering for Generative AI. O'Reilly Media, 2024.
- Lewis, Patrick, et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.