RAG (Retrieval-Augmented Generation)

TODO: need to update the whole article based on mastra and its core ideas: RAG, hybrid search, GraphRAG

In early 2024, a Fortune 500 company deployed an AI assistant to handle employee HR questions. Within the first week, it confidently told dozens of employees they had 60 days to file expense reports. The actual policy? 30 days—updated six months after the model's training cutoff. The cleanup cost more than the entire AI project budget.

This wasn't a hallucination in the traditional sense. The model was correctly recalling its training data. The problem was simpler and more insidious: the world had moved on, but the model's knowledge hadn't.

This is the fundamental limitation that RAG (Retrieval-Augmented Generation) solves. Instead of relying solely on frozen training data, RAG gives your agent a live connection to authoritative, up-to-date information—your company's policies, your product documentation, your customer data—and grounds its responses in that reality.

The Good News

RAG has matured rapidly. In 2023, building a RAG pipeline meant stitching together embeddings, vector databases, chunking strategies, and retrieval logic yourself. By 2025, every major provider offers turnkey RAG solutions. This chapter teaches the concepts first, then shows you how to use these managed services effectively.


1. Why RAG? The Knowledge Gap Problem

LLMs have three fundamental knowledge limitations:

LimitationExampleRAG Solution
Knowledge Cutoff"What's our Q3 2025 revenue?" → Model doesn't knowRetrieve from your financial database
Private Data"What's John's remaining PTO?" → Never in training dataRetrieve from your HR system
Stale Information"What's our refund policy?" → Policy changed last monthRetrieve current policy document

Can't We Just Fill the Context Window?

Modern models like Gemini 2.0 offer 1M+ token context windows. If the model's knowledge is outdated, why not just paste all your documents into the prompt?

This intuitive approach has three fundamental problems:

1. Data Exceeds Context Limits

Even 1M tokens (~750k words) sounds enormous until you realize a typical enterprise knowledge base contains millions of documents. Your company's Confluence, Notion, internal wikis, Slack history, and documentation easily exceed any context window.

2. Context is a Precious Resource

The context window isn't just storage—it's the model's working memory. Every token you spend on background documents is a token unavailable for:

  • Detailed instructions and constraints
  • Conversation history
  • Intermediate reasoning steps
  • The actual user query

Filling the context with "just in case" data degrades performance on everything else.

3. Models Have Attention Limits

Even if your data fits, retrieval from within a massive context is unreliable. Research on the "Lost in the Middle" phenomenon shows that LLMs struggle to use information buried in the middle of long contexts. They attend well to the beginning and end, but accuracy drops sharply for content in between.

Attention Distribution in Long Contexts:
 
[Beginning] ████████████  High attention
[Middle]    ███            Poor attention  ← Information gets "lost"
[End]       █████████      Moderate attention

RAG solves this by retrieving only the relevant chunks and placing them prominently in the context—where the model can actually use them.


2. RAG's Limitations (Know Before You Build)

Before diving into implementation, understand what RAG cannot do. These limitations will save you from painful debugging later.

RAG Does Not Eliminate Hallucination

A common misconception: "If I ground the model in retrieved documents, it won't hallucinate."

Wrong. RAG reduces hallucination but doesn't eliminate it. The model can still:

  • Misinterpret retrieved content
  • Blend retrieved facts incorrectly
  • Confidently extrapolate beyond what the documents say
  • Ignore retrieved context entirely when it conflicts with training data
RAG ≠ Accuracy Guarantee

RAG improves factual grounding, but it's not a substitute for output validation. Critical applications still need verification layers.

Precision Data Doesn't Belong in RAG

RAG excels at semantic similarity—finding conceptually related content. It struggles with:

Data TypeWhy RAG FailsBetter Approach
User IDs, Order NumbersUSR_12345 and USR_12346 are semantically identicalDirect database lookup
Exact figures, prices"$99.99" vs "$89.99" have similar embeddingsStructured query
Code snippetsSyntax matters more than semanticsExact text search
Legal/compliance textEvery word mattersFull document retrieval

If you need exact matches, RAG's fuzzy semantic matching can make things worse—returning "close enough" results that look right but are subtly wrong.

Sometimes Primitive Search Wins

Here's a counterintuitive truth: for reliability-critical memory and retrieval, some production agents abandon RAG entirely in favor of simpler approaches:

  • Exact text search: Find the literal string, retrieve surrounding context
  • Keyword matching: BM25 or TF-IDF without embeddings
  • Structured storage: Key-value stores, SQL databases
  • Hybrid approaches: RAG for discovery, exact lookup for precision

The lesson: RAG is powerful for discovery ("find documents about X"), but for precision ("find the exact value of Y"), traditional methods often work better.


3. Core Concepts

With limitations understood, let's learn the building blocks. Once you know these, the system diagram will make immediate sense.

Embeddings: The Language of Similarity

An embedding is a vector (list of numbers) that represents the meaning of text. Similar concepts produce similar vectors.

# Conceptually, embeddings capture semantic meaning
embed("Dog")      # [0.23, 0.67, 0.12, ...]
embed("Puppy")    # [0.25, 0.65, 0.14, ...]  ← Very close!
embed("Server")   # [0.89, 0.02, 0.45, ...]  ← Very different
 
# Similarity is measured by cosine distance
similarity("Dog", "Puppy")   # 0.97 (almost identical)
similarity("Dog", "Server")  # 0.12 (unrelated)

This is why RAG can find "lunar landing 1969" when you search for "moon mission"—they occupy the same neighborhood in embedding space, even with zero keyword overlap.

Popular embedding models:

  • OpenAI: text-embedding-3-small, text-embedding-3-large
  • Google: text-embedding-004
  • Open source: nomic-embed-text, bge-large

Vector Databases: The Retrieval Engine

A vector database stores embeddings and enables fast similarity search across millions of documents. When you query, it finds the vectors closest to your query vector.

TypeOptionsWhen to Use
Managed (recommended)Vertex AI Vector Search, OpenAI Vector Stores, PineconeProduction workloads
Self-hostedpgvector, Chroma, MilvusPrototyping, cost control, data sovereignty
Enterprise Recommendation

For production workloads, use your cloud provider's managed solution. Vertex AI RAG Engine or OpenAI Vector Stores handle embedding, chunking, indexing, and retrieval—letting you focus on your application logic rather than infrastructure.

Chunking: Splitting Documents Intelligently

Documents must be split into chunks before embedding. Chunk size affects retrieval quality:

Chunk SizeProsCons
Small (100-200 tokens)Precise retrievalMay lose context
Medium (400-800 tokens)Good balanceStandard choice
Large (1000+ tokens)Full contextMay include irrelevant content

Overlap ensures continuity. A 400-token chunk with 100-token overlap prevents sentences from being cut mid-thought:

Document: [1000 tokens total]
 
Chunk 1: [tokens 0-400]
Chunk 2: [tokens 300-700]    ← 100 token overlap with Chunk 1
Chunk 3: [tokens 600-1000]   ← 100 token overlap with Chunk 2

Semantic chunking (splitting at paragraph/section boundaries) often outperforms fixed-size chunking. Most managed services handle this automatically.


4. How RAG Works

Now that you understand embeddings, vector databases, and chunking, here's how they fit together:

The Four Steps

  1. Query: User asks a question ("How do I reset the server?")

  2. Retrieve: The system converts the query into an embedding and searches the vector database for semantically similar document chunks. This finds relevant content even without exact keyword matches.

  3. Augment: The retrieved chunks are injected into the prompt as context:

    Context:
    [Server Administration Guide, Section 4.2]
    To reset the production server, SSH into admin@prod-01...
     
    [Incident Response Playbook]
    Server resets require approval from on-call SRE...
     
    Question: How do I reset the server?
  4. Generate: The LLM answers using the provided context, grounding its response in your actual documentation.

The key insight: the LLM never searches directly. It only sees what the retrieval system puts in front of it. This is why retrieval quality determines RAG quality.


5. Using Managed RAG Services

Modern providers offer end-to-end RAG infrastructure. You upload documents, they handle chunking, embedding, indexing, and retrieval.

OpenAI Vector Stores API

OpenAI's Retrieval API provides a clean, self-contained RAG solution:

from openai import OpenAI
 
client = OpenAI()
 
# 1. Create a vector store
vector_store = client.vector_stores.create(name="product-docs")
 
# 2. Upload and index documents (chunking + embedding handled automatically)
client.vector_stores.files.upload_and_poll(
    vector_store_id=vector_store.id,
    file=open("server_admin_guide.pdf", "rb"),
)
 
# 3. Search for relevant content
results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query="How do I reset the server?",
)
 
# 4. Use results with Chat Completions
context = "\n".join([chunk.content[0].text for chunk in results.data])
 
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[
        {"role": "system", "content": f"Answer based on this context:\n{context}"},
        {"role": "user", "content": "How do I reset the server?"},
    ],
)

Google Vertex AI RAG Engine

Vertex AI RAG Engine provides a fully managed pipeline that integrates natively with Gemini:

from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
 
# Initialize Vertex AI
vertexai.init(project="your-project", location="us-central1")
 
# 1. Create a RAG Corpus (your knowledge base)
corpus = rag.create_corpus(display_name="product-docs")
 
# 2. Import your documents
rag.import_files(
    corpus.name,
    paths=["gs://your-bucket/docs/"],  # GCS path
    chunk_size=512,
    chunk_overlap=100,
)
 
# 3. Create a retrieval tool for Gemini
rag_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_corpora=[corpus.name],
            similarity_top_k=5,
        ),
    )
)
 
# 4. Use with Gemini (retrieval happens automatically)
model = GenerativeModel("gemini-2.0-flash", tools=[rag_tool])
response = model.generate_content("How do I reset the server?")
print(response.text)

Which Provider to Choose?

ProviderBest ForKey Features
OpenAI Vector StoresOpenAI ecosystem usersSimple API, built-in chunking, query rewriting
Vertex AI RAG EngineGoogle Cloud / Gemini usersNative Gemini integration, GCS support
Pinecone / WeaviateMulti-model, vendor-agnosticFlexibility, hybrid search, metadata filtering
pgvectorExisting PostgreSQL usersNo new infrastructure, SQL familiarity

6. 🔨 Project: Document Q&A Bot

Let's build a complete RAG-powered Q&A bot. We'll use OpenAI's Vector Stores API since it handles the most complexity for us.

Setup

pip install openai
export OPENAI_API_KEY="your-key"

Implementation

from openai import OpenAI
 
client = OpenAI()
 
def create_knowledge_base(name: str, file_paths: list[str]) -> str:
    """Create a vector store and upload documents."""
    # Create vector store
    vector_store = client.vector_stores.create(name=name)
    
    # Upload each file
    for path in file_paths:
        print(f"Uploading {path}...")
        with open(path, "rb") as f:
            client.vector_stores.files.upload_and_poll(
                vector_store_id=vector_store.id,
                file=f,
            )
    
    print(f"✓ Knowledge base '{name}' created with {len(file_paths)} documents")
    return vector_store.id
 
 
def ask(vector_store_id: str, question: str) -> str:
    """Ask a question against the knowledge base."""
    # 1. Retrieve relevant chunks
    results = client.vector_stores.search(
        vector_store_id=vector_store_id,
        query=question,
        max_num_results=5,
        rewrite_query=True,  # Clean up messy queries
    )
    
    # 2. Format context from results
    context_parts = []
    for i, result in enumerate(results.data, 1):
        text = "\n".join(c.text for c in result.content)
        context_parts.append(f"[Source {i}: {result.filename}]\n{text}")
    
    context = "\n\n".join(context_parts)
    
    # 3. Generate grounded response
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant that answers questions 
based only on the provided context. If the context doesn't contain 
the answer, say "I don't have information about that in my knowledge base."
Always cite which source document you're using.""",
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            },
        ],
    )
    
    return response.choices[0].message.content
 
 
# Usage
if __name__ == "__main__":
    # Create knowledge base (run once)
    vs_id = create_knowledge_base(
        name="product-docs",
        file_paths=[
            "docs/user_guide.pdf",
            "docs/api_reference.md",
            "docs/faq.txt",
        ],
    )
    
    # Interactive Q&A loop
    print("\nAsk questions about your documents (type 'quit' to exit)\n")
    while True:
        question = input("You: ")
        if question.lower() in ["quit", "exit"]:
            break
        
        answer = ask(vs_id, question)
        print(f"\nAssistant: {answer}\n")

What You Built

This bot demonstrates the complete RAG pattern:

StepWhat Happens
IndexingDocuments uploaded, chunked, embedded, and stored
RetrievalQuery embedded, similar chunks found via vector search
AugmentationRetrieved chunks formatted as context in the prompt
GenerationLLM answers using only the retrieved context
GroundingResponse cites source documents

For production, you'd add error handling, caching, and evaluation metrics—but the core pattern is exactly this.


7. Advanced Techniques

Once basic RAG is working, these techniques improve retrieval quality when you hit edge cases.

Hybrid Search: Best of Both Worlds

Pure vector search excels at semantic similarity but can miss exact matches. Hybrid search combines:

  • Semantic search (vectors): Finds conceptually related content
  • Keyword search (BM25): Finds exact term matches
Query: "Error code ERR_429_RATE_LIMIT"
 
Vector search alone: Might return general rate limiting docs
Keyword search alone: Finds exact error code mentions
Hybrid search: Combines both, weighted by relevance

OpenAI's Vector Stores support hybrid ranking:

results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query="ERR_429_RATE_LIMIT troubleshooting",
    ranking_options={
        "ranker": "auto",
        "score_threshold": 0.5,
    },
)

Query Rewriting

User queries are often messy. Query rewriting transforms them into optimal search queries:

User QueryRewritten Query
"how do i do the thing with the server again""server restart procedure"
"whats the refund thing""refund policy terms conditions"
"ERR_429 help plz""ERR_429 error troubleshooting resolution"
results = client.vector_stores.search(
    vector_store_id=vector_store.id,
    query="whats the refund thing",
    rewrite_query=True,  # Automatically optimizes the query
)
# The rewritten query appears in results.search_query

Re-ranking: Precision Over Speed

Vector databases optimize for speed, returning approximate nearest neighbors. For higher precision, use two-stage retrieval:

The reranker (a specialized model like bge-reranker or Cohere Rerank) scores each candidate against the query with high accuracy.

Hypothetical Document Embeddings (HyDE)

When user queries are vague or short, they don't embed well. HyDE flips the script:

  1. Ask the LLM to generate a hypothetical answer to the query
  2. Embed that hypothetical answer
  3. Search for documents similar to the answer (not the question)
# User query: "server reset"  — too short to embed meaningfully
 
# Step 1: Generate hypothetical answer
hypothetical = llm.generate(
    "Write a paragraph that would answer: 'server reset'"
)
# "To reset the production server, first ensure all active 
#  connections are drained. SSH into the admin console..."
 
# Step 2: Search using the hypothetical answer's embedding
results = vector_db.search(embed(hypothetical), top_k=5)

HyDE is powerful for vague queries but adds latency (one extra LLM call). Use it selectively.


8. RAG in Your Agent's Toolbox

In previous chapters, we built agents with tools. RAG is just another tool—one that retrieves knowledge from your private data.

Exposing RAG as a Tool

With Google ADK and Vertex AI RAG Engine:

from google.adk import Agent
from google.adk.tools import VertexRagTool
 
# Create the RAG tool
rag_tool = VertexRagTool(
    corpus_name="projects/your-project/locations/us-central1/ragCorpora/docs",
    description="Search internal product documentation and policies",
)
 
# Give it to your agent alongside other tools
agent = Agent(
    model="gemini-2.0-flash",
    tools=[rag_tool, web_search, calculator],
    system_prompt="""You are a helpful assistant. Choose the right tool:
    - Internal policies/docs → RAG tool
    - Current events/public info → Web search
    - Math calculations → Calculator
    
    Always cite your sources.""",
)

RAG vs. Web Search: When to Use Which

ScenarioUse RAGUse Web Search
Internal company policies
Product documentation
Customer-specific data
Current events / news
General public knowledge

A well-designed agent has both tools and chooses based on the query type. The agent reasons: "This question is about our refund policy—that's internal, so I'll use the RAG tool" vs. "They're asking about today's weather—that's public info, I'll search the web."


Key Takeaways

  1. RAG bridges the knowledge gap between static training data and dynamic, private information your agent needs.

  2. Three building blocks: Embeddings (meaning as vectors), Vector Databases (fast similarity search), Chunking (splitting docs intelligently).

  3. The pattern is simple: Query → Retrieve → Augment → Generate. When something goes wrong, debug each step.

  4. Use managed services for production. OpenAI Vector Stores, Vertex AI RAG Engine, and similar offerings handle the infrastructure so you can focus on your application.

  5. Advanced techniques (hybrid search, query rewriting, re-ranking, HyDE) matter when basic RAG hits edge cases.

  6. RAG is just another tool in your agent's toolbox. Combine it with web search and other tools for a capable assistant.


References


Next: Memory & Persistence—giving your agent memory that persists across conversations.