RAG: Retrieval-Augmented Generation Fundamentals

Large language models have a fundamental limitation: their knowledge is frozen at training time. Retrieval-Augmented Generation (RAG) solves this by combining LLMs with external knowledge retrieval, enabling accurate responses based on current information, private documents, or specialized data.

What Is RAG?

RAG is an architecture that retrieves relevant information from a knowledge base and includes it in the prompt before generating a response.

flowchart LR
    A[User Query] --> B[Retrieval System]
    B --> C[(Knowledge Base)]
    C --> D[Relevant Documents]
    D --> E[Augmented Prompt]
    A --> E
    E --> F[LLM]
    F --> G[Response]

    style B fill:#8b5cf6,color:#fff
    style C fill:#3b82f6,color:#fff
    style F fill:#10b981,color:#fff

Without RAG

User: What's our company's refund policy?
LLM: I don't have access to your company's specific policies...

With RAG

[Retrieved context: "Our refund policy allows returns within 30 days
of purchase. Items must be unused and in original packaging. Digital
products are non-refundable after download."]

User: What's our company's refund policy?
LLM: Based on your company policy, refunds are available within 30 days
of purchase. Items need to be unused and in original packaging. Note
that digital products cannot be refunded once downloaded.

The RAG Pipeline

1. Document Ingestion

Documents are processed and converted into searchable chunks:

flowchart TD
    A[Raw Documents] --> B[Text Extraction]
    B --> C[Chunking]
    C --> D[Embedding]
    D --> E[(Vector Database)]

    style A fill:#f59e0b,color:#fff
    style E fill:#3b82f6,color:#fff

Chunking Strategies:

Strategy	Description	Best For
Fixed size	Split by character/token count	Simple documents
Sentence	Split at sentence boundaries	Narrative text
Paragraph	Split at paragraph breaks	Structured documents
Semantic	Split by topic/meaning	Complex content

2. Query Processing

When a user asks a question:

async function processQuery(userQuery) {
  // 1. Convert query to embedding
  const queryEmbedding = await embedText(userQuery);

  // 2. Search vector database for similar chunks
  const relevantChunks = await vectorDB.search(queryEmbedding, {
    topK: 5,
    threshold: 0.7
  });

  // 3. Build augmented prompt
  const context = relevantChunks.map(c => c.text).join('\n\n');

  return buildPrompt(context, userQuery);
}

3. Prompt Construction

Combine retrieved context with the user's question:

Use the following context to answer the question. If the answer is not
in the context, say "I don't have enough information to answer that."

Context:
---
{retrieved_documents}
---

Question: {user_question}

Answer:

Embedding and Vector Search

Embeddings convert text into numerical vectors that capture semantic meaning:

flowchart LR
    A["'What is machine learning?'"] --> B[Embedding Model]
    B --> C["[0.23, -0.45, 0.12, ...]"]

    D["'Explain ML algorithms'"] --> B
    B --> E["[0.21, -0.43, 0.14, ...]"]

    C --> F{Similar vectors}
    E --> F

    style B fill:#8b5cf6,color:#fff
    style F fill:#10b981,color:#fff

Similarity Measures

Measure	Description	Range
Cosine similarity	Angle between vectors	-1 to 1
Euclidean distance	Straight-line distance	0 to ∞
Dot product	Magnitude-aware similarity	-∞ to ∞

RAG Best Practices

1. Chunk Size Optimization

Too small chunks:
- Lose context
- More retrieval noise
- Incomplete information

Too large chunks:
- Diluted relevance
- Token limit issues
- Slower processing

Sweet spot: 200-500 tokens with overlap

2. Add Metadata for Filtering

const document = {
  text: "Our Q3 2024 revenue increased by 15%...",
  metadata: {
    source: "quarterly_report",
    date: "2024-10-01",
    department: "finance",
    confidentiality: "internal"
  }
};

// Filter retrieval by metadata
const results = await vectorDB.search(query, {
  filter: {
    department: "finance",
    date: { $gte: "2024-01-01" }
  }
});

3. Hybrid Search

Combine vector search with keyword search for better results:

flowchart TD
    A[Query] --> B[Vector Search]
    A --> C[Keyword Search]
    B --> D[Semantic Results]
    C --> E[Exact Match Results]
    D --> F[Fusion/Reranking]
    E --> F
    F --> G[Final Results]

    style F fill:#10b981,color:#fff

4. Context Window Management

function buildContext(chunks, maxTokens = 3000) {
  let context = [];
  let tokenCount = 0;

  for (const chunk of chunks) {
    const chunkTokens = countTokens(chunk.text);
    if (tokenCount + chunkTokens > maxTokens) break;

    context.push(chunk.text);
    tokenCount += chunkTokens;
  }

  return context.join('\n\n---\n\n');
}

Advanced RAG Patterns

Query Transformation

Improve retrieval by reformulating queries:

Original query: "Why isn't it working?"

Transformed queries:
1. "Common errors and troubleshooting steps"
2. "Error messages and their solutions"
3. "Debugging guide for [product]"

Multi-Query RAG

Generate multiple query variations and merge results:

async function multiQueryRAG(originalQuery) {
  // Generate query variations
  const variations = await llm.generate(`
    Generate 3 different ways to ask this question:
    "${originalQuery}"
  `);

  // Retrieve for each variation
  const allResults = await Promise.all(
    [originalQuery, ...variations].map(q => retrieve(q))
  );

  // Deduplicate and rerank
  return rerank(deduplicate(allResults.flat()));
}

Self-RAG (Critique and Refine)

Have the LLM evaluate retrieval quality:

Given this context and question, first evaluate:
1. Is the context relevant to the question? (yes/no)
2. Does the context contain enough information? (yes/no)
3. Is any information potentially outdated? (yes/no)

If all answers are "yes", provide your answer.
If not, explain what additional information is needed.

Context: {context}
Question: {question}

Common RAG Challenges

1. Retrieval Quality

Problem	Solution
Irrelevant results	Improve chunking, add metadata filters
Missing information	Increase top-K, use hybrid search
Outdated content	Implement document versioning

2. Hallucination Prevention

Answer based ONLY on the provided context.
If the information is not in the context, respond with:
"I cannot find this information in the available documents."

DO NOT use any external knowledge or make assumptions.

3. Citation and Attribution

When answering, cite your sources in this format:
- Use [1], [2], etc. for inline citations
- List sources at the end with document names

Context:
[1] company_policy.pdf: "Employees receive 20 vacation days..."
[2] hr_handbook.pdf: "Unused vacation days can be carried over..."

Summary

Component	Purpose	Key Consideration
Chunking	Break documents into searchable units	Size and overlap
Embedding	Convert text to vectors	Model selection
Vector DB	Store and search embeddings	Scalability
Retrieval	Find relevant context	Precision vs recall
Prompt	Combine context with query	Clear instructions

RAG transforms LLMs from static knowledge systems into dynamic tools that can work with your specific data. By combining the reasoning capabilities of LLMs with accurate information retrieval, you can build reliable AI applications for documentation, customer support, research, and more.

References

Phoenix, James and Taylor, Mike. Prompt Engineering for Generative AI. O'Reilly Media, 2024.
Lewis, Patrick, et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.