RAG (Retrieval-Augmented Generation)
TODO: need to update the whole article based on mastra and its core ideas: RAG, hybrid search, GraphRAGIn early 2024, a Fortune 500 company deployed an AI assistant to handle employee HR questions. Within the first week, it confidently told dozens of employees they had 60 days to file expense reports. The actual policy? 30 days—updated six months after the model's training cutoff. The cleanup cost more than the entire AI project budget.
This wasn't a hallucination in the traditional sense. The model was correctly recalling its training data. The problem was simpler and more insidious: the world had moved on, but the model's knowledge hadn't.
This is the fundamental limitation that RAG (Retrieval-Augmented Generation) solves. Instead of relying solely on frozen training data, RAG gives your agent a live connection to authoritative, up-to-date information—your company's policies, your product documentation, your customer data—and grounds its responses in that reality.
RAG has matured rapidly. In 2023, building a RAG pipeline meant stitching together embeddings, vector databases, chunking strategies, and retrieval logic yourself. By 2025, every major provider offers turnkey RAG solutions. This chapter teaches the concepts first, then shows you how to use these managed services effectively.
1. Why RAG? The Knowledge Gap Problem
LLMs have three fundamental knowledge limitations:
| Limitation | Example | RAG Solution |
|---|---|---|
| Knowledge Cutoff | "What's our Q3 2025 revenue?" → Model doesn't know | Retrieve from your financial database |
| Private Data | "What's John's remaining PTO?" → Never in training data | Retrieve from your HR system |
| Stale Information | "What's our refund policy?" → Policy changed last month | Retrieve current policy document |
Can't We Just Fill the Context Window?
Modern models like Gemini 2.0 offer 1M+ token context windows. If the model's knowledge is outdated, why not just paste all your documents into the prompt?
This intuitive approach has three fundamental problems:
1. Data Exceeds Context Limits
Even 1M tokens (~750k words) sounds enormous until you realize a typical enterprise knowledge base contains millions of documents. Your company's Confluence, Notion, internal wikis, Slack history, and documentation easily exceed any context window.
2. Context is a Precious Resource
The context window isn't just storage—it's the model's working memory. Every token you spend on background documents is a token unavailable for:
- Detailed instructions and constraints
- Conversation history
- Intermediate reasoning steps
- The actual user query
Filling the context with "just in case" data degrades performance on everything else.
3. Models Have Attention Limits
Even if your data fits, retrieval from within a massive context is unreliable. Research on the "Lost in the Middle" phenomenon shows that LLMs struggle to use information buried in the middle of long contexts. They attend well to the beginning and end, but accuracy drops sharply for content in between.
Attention Distribution in Long Contexts:
[Beginning] ████████████ High attention
[Middle] ███ Poor attention ← Information gets "lost"
[End] █████████ Moderate attentionRAG solves this by retrieving only the relevant chunks and placing them prominently in the context—where the model can actually use them.
2. RAG's Limitations (Know Before You Build)
Before diving into implementation, understand what RAG cannot do. These limitations will save you from painful debugging later.
RAG Does Not Eliminate Hallucination
A common misconception: "If I ground the model in retrieved documents, it won't hallucinate."
Wrong. RAG reduces hallucination but doesn't eliminate it. The model can still:
- Misinterpret retrieved content
- Blend retrieved facts incorrectly
- Confidently extrapolate beyond what the documents say
- Ignore retrieved context entirely when it conflicts with training data
RAG improves factual grounding, but it's not a substitute for output validation. Critical applications still need verification layers.
Precision Data Doesn't Belong in RAG
RAG excels at semantic similarity—finding conceptually related content. It struggles with:
| Data Type | Why RAG Fails | Better Approach |
|---|---|---|
| User IDs, Order Numbers | USR_12345 and USR_12346 are semantically identical | Direct database lookup |
| Exact figures, prices | "$99.99" vs "$89.99" have similar embeddings | Structured query |
| Code snippets | Syntax matters more than semantics | Exact text search |
| Legal/compliance text | Every word matters | Full document retrieval |
If you need exact matches, RAG's fuzzy semantic matching can make things worse—returning "close enough" results that look right but are subtly wrong.
Sometimes Primitive Search Wins
Here's a counterintuitive truth: for reliability-critical memory and retrieval, some production agents abandon RAG entirely in favor of simpler approaches:
- Exact text search: Find the literal string, retrieve surrounding context
- Keyword matching: BM25 or TF-IDF without embeddings
- Structured storage: Key-value stores, SQL databases
- Hybrid approaches: RAG for discovery, exact lookup for precision
The lesson: RAG is powerful for discovery ("find documents about X"), but for precision ("find the exact value of Y"), traditional methods often work better.
3. Core Concepts
With limitations understood, let's learn the building blocks. Once you know these, the system diagram will make immediate sense.
Embeddings: The Language of Similarity
An embedding is a vector (list of numbers) that represents the meaning of text. Similar concepts produce similar vectors.
# Conceptually, embeddings capture semantic meaning
embed("Dog") # [0.23, 0.67, 0.12, ...]
embed("Puppy") # [0.25, 0.65, 0.14, ...] ← Very close!
embed("Server") # [0.89, 0.02, 0.45, ...] ← Very different
# Similarity is measured by cosine distance
similarity("Dog", "Puppy") # 0.97 (almost identical)
similarity("Dog", "Server") # 0.12 (unrelated)This is why RAG can find "lunar landing 1969" when you search for "moon mission"—they occupy the same neighborhood in embedding space, even with zero keyword overlap.
Popular embedding models:
- OpenAI:
text-embedding-3-small,text-embedding-3-large - Google:
text-embedding-004 - Open source:
nomic-embed-text,bge-large
Vector Databases: The Retrieval Engine
A vector database stores embeddings and enables fast similarity search across millions of documents. When you query, it finds the vectors closest to your query vector.
| Type | Options | When to Use |
|---|---|---|
| Managed (recommended) | Vertex AI Vector Search, OpenAI Vector Stores, Pinecone | Production workloads |
| Self-hosted | pgvector, Chroma, Milvus | Prototyping, cost control, data sovereignty |
For production workloads, use your cloud provider's managed solution. Vertex AI RAG Engine or OpenAI Vector Stores handle embedding, chunking, indexing, and retrieval—letting you focus on your application logic rather than infrastructure.
Chunking: Splitting Documents Intelligently
Documents must be split into chunks before embedding. Chunk size affects retrieval quality:
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (100-200 tokens) | Precise retrieval | May lose context |
| Medium (400-800 tokens) | Good balance | Standard choice |
| Large (1000+ tokens) | Full context | May include irrelevant content |
Overlap ensures continuity. A 400-token chunk with 100-token overlap prevents sentences from being cut mid-thought:
Document: [1000 tokens total]
Chunk 1: [tokens 0-400]
Chunk 2: [tokens 300-700] ← 100 token overlap with Chunk 1
Chunk 3: [tokens 600-1000] ← 100 token overlap with Chunk 2Semantic chunking (splitting at paragraph/section boundaries) often outperforms fixed-size chunking. Most managed services handle this automatically.
4. How RAG Works
Now that you understand embeddings, vector databases, and chunking, here's how they fit together:
The Four Steps
-
Query: User asks a question ("How do I reset the server?")
-
Retrieve: The system converts the query into an embedding and searches the vector database for semantically similar document chunks. This finds relevant content even without exact keyword matches.
-
Augment: The retrieved chunks are injected into the prompt as context:
Context: [Server Administration Guide, Section 4.2] To reset the production server, SSH into admin@prod-01... [Incident Response Playbook] Server resets require approval from on-call SRE... Question: How do I reset the server? -
Generate: The LLM answers using the provided context, grounding its response in your actual documentation.
The key insight: the LLM never searches directly. It only sees what the retrieval system puts in front of it. This is why retrieval quality determines RAG quality.
5. Using Managed RAG Services
Modern providers offer end-to-end RAG infrastructure. You upload documents, they handle chunking, embedding, indexing, and retrieval.
OpenAI Vector Stores API
OpenAI's Retrieval API provides a clean, self-contained RAG solution:
from openai import OpenAI
client = OpenAI()
# 1. Create a vector store
vector_store = client.vector_stores.create(name="product-docs")
# 2. Upload and index documents (chunking + embedding handled automatically)
client.vector_stores.files.upload_and_poll(
vector_store_id=vector_store.id,
file=open("server_admin_guide.pdf", "rb"),
)
# 3. Search for relevant content
results = client.vector_stores.search(
vector_store_id=vector_store.id,
query="How do I reset the server?",
)
# 4. Use results with Chat Completions
context = "\n".join([chunk.content[0].text for chunk in results.data])
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": "How do I reset the server?"},
],
)Google Vertex AI RAG Engine
Vertex AI RAG Engine provides a fully managed pipeline that integrates natively with Gemini:
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
# Initialize Vertex AI
vertexai.init(project="your-project", location="us-central1")
# 1. Create a RAG Corpus (your knowledge base)
corpus = rag.create_corpus(display_name="product-docs")
# 2. Import your documents
rag.import_files(
corpus.name,
paths=["gs://your-bucket/docs/"], # GCS path
chunk_size=512,
chunk_overlap=100,
)
# 3. Create a retrieval tool for Gemini
rag_tool = Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_corpora=[corpus.name],
similarity_top_k=5,
),
)
)
# 4. Use with Gemini (retrieval happens automatically)
model = GenerativeModel("gemini-2.0-flash", tools=[rag_tool])
response = model.generate_content("How do I reset the server?")
print(response.text)Which Provider to Choose?
| Provider | Best For | Key Features |
|---|---|---|
| OpenAI Vector Stores | OpenAI ecosystem users | Simple API, built-in chunking, query rewriting |
| Vertex AI RAG Engine | Google Cloud / Gemini users | Native Gemini integration, GCS support |
| Pinecone / Weaviate | Multi-model, vendor-agnostic | Flexibility, hybrid search, metadata filtering |
| pgvector | Existing PostgreSQL users | No new infrastructure, SQL familiarity |
6. 🔨 Project: Document Q&A Bot
Let's build a complete RAG-powered Q&A bot. We'll use OpenAI's Vector Stores API since it handles the most complexity for us.
Setup
pip install openai
export OPENAI_API_KEY="your-key"Implementation
from openai import OpenAI
client = OpenAI()
def create_knowledge_base(name: str, file_paths: list[str]) -> str:
"""Create a vector store and upload documents."""
# Create vector store
vector_store = client.vector_stores.create(name=name)
# Upload each file
for path in file_paths:
print(f"Uploading {path}...")
with open(path, "rb") as f:
client.vector_stores.files.upload_and_poll(
vector_store_id=vector_store.id,
file=f,
)
print(f"✓ Knowledge base '{name}' created with {len(file_paths)} documents")
return vector_store.id
def ask(vector_store_id: str, question: str) -> str:
"""Ask a question against the knowledge base."""
# 1. Retrieve relevant chunks
results = client.vector_stores.search(
vector_store_id=vector_store_id,
query=question,
max_num_results=5,
rewrite_query=True, # Clean up messy queries
)
# 2. Format context from results
context_parts = []
for i, result in enumerate(results.data, 1):
text = "\n".join(c.text for c in result.content)
context_parts.append(f"[Source {i}: {result.filename}]\n{text}")
context = "\n\n".join(context_parts)
# 3. Generate grounded response
response = client.chat.completions.create(
model="gpt-4.1",
messages=[
{
"role": "system",
"content": """You are a helpful assistant that answers questions
based only on the provided context. If the context doesn't contain
the answer, say "I don't have information about that in my knowledge base."
Always cite which source document you're using.""",
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
},
],
)
return response.choices[0].message.content
# Usage
if __name__ == "__main__":
# Create knowledge base (run once)
vs_id = create_knowledge_base(
name="product-docs",
file_paths=[
"docs/user_guide.pdf",
"docs/api_reference.md",
"docs/faq.txt",
],
)
# Interactive Q&A loop
print("\nAsk questions about your documents (type 'quit' to exit)\n")
while True:
question = input("You: ")
if question.lower() in ["quit", "exit"]:
break
answer = ask(vs_id, question)
print(f"\nAssistant: {answer}\n")What You Built
This bot demonstrates the complete RAG pattern:
| Step | What Happens |
|---|---|
| Indexing | Documents uploaded, chunked, embedded, and stored |
| Retrieval | Query embedded, similar chunks found via vector search |
| Augmentation | Retrieved chunks formatted as context in the prompt |
| Generation | LLM answers using only the retrieved context |
| Grounding | Response cites source documents |
For production, you'd add error handling, caching, and evaluation metrics—but the core pattern is exactly this.
7. Advanced Techniques
Once basic RAG is working, these techniques improve retrieval quality when you hit edge cases.
Hybrid Search: Best of Both Worlds
Pure vector search excels at semantic similarity but can miss exact matches. Hybrid search combines:
- Semantic search (vectors): Finds conceptually related content
- Keyword search (BM25): Finds exact term matches
Query: "Error code ERR_429_RATE_LIMIT"
Vector search alone: Might return general rate limiting docs
Keyword search alone: Finds exact error code mentions
Hybrid search: Combines both, weighted by relevanceOpenAI's Vector Stores support hybrid ranking:
results = client.vector_stores.search(
vector_store_id=vector_store.id,
query="ERR_429_RATE_LIMIT troubleshooting",
ranking_options={
"ranker": "auto",
"score_threshold": 0.5,
},
)Query Rewriting
User queries are often messy. Query rewriting transforms them into optimal search queries:
| User Query | Rewritten Query |
|---|---|
| "how do i do the thing with the server again" | "server restart procedure" |
| "whats the refund thing" | "refund policy terms conditions" |
| "ERR_429 help plz" | "ERR_429 error troubleshooting resolution" |
results = client.vector_stores.search(
vector_store_id=vector_store.id,
query="whats the refund thing",
rewrite_query=True, # Automatically optimizes the query
)
# The rewritten query appears in results.search_queryRe-ranking: Precision Over Speed
Vector databases optimize for speed, returning approximate nearest neighbors. For higher precision, use two-stage retrieval:
The reranker (a specialized model like bge-reranker or Cohere Rerank) scores each candidate against the query with high accuracy.
Hypothetical Document Embeddings (HyDE)
When user queries are vague or short, they don't embed well. HyDE flips the script:
- Ask the LLM to generate a hypothetical answer to the query
- Embed that hypothetical answer
- Search for documents similar to the answer (not the question)
# User query: "server reset" — too short to embed meaningfully
# Step 1: Generate hypothetical answer
hypothetical = llm.generate(
"Write a paragraph that would answer: 'server reset'"
)
# "To reset the production server, first ensure all active
# connections are drained. SSH into the admin console..."
# Step 2: Search using the hypothetical answer's embedding
results = vector_db.search(embed(hypothetical), top_k=5)HyDE is powerful for vague queries but adds latency (one extra LLM call). Use it selectively.
8. RAG in Your Agent's Toolbox
In previous chapters, we built agents with tools. RAG is just another tool—one that retrieves knowledge from your private data.
Exposing RAG as a Tool
With Google ADK and Vertex AI RAG Engine:
from google.adk import Agent
from google.adk.tools import VertexRagTool
# Create the RAG tool
rag_tool = VertexRagTool(
corpus_name="projects/your-project/locations/us-central1/ragCorpora/docs",
description="Search internal product documentation and policies",
)
# Give it to your agent alongside other tools
agent = Agent(
model="gemini-2.0-flash",
tools=[rag_tool, web_search, calculator],
system_prompt="""You are a helpful assistant. Choose the right tool:
- Internal policies/docs → RAG tool
- Current events/public info → Web search
- Math calculations → Calculator
Always cite your sources.""",
)RAG vs. Web Search: When to Use Which
| Scenario | Use RAG | Use Web Search |
|---|---|---|
| Internal company policies | ✅ | ❌ |
| Product documentation | ✅ | ❌ |
| Customer-specific data | ✅ | ❌ |
| Current events / news | ❌ | ✅ |
| General public knowledge | ❌ | ✅ |
A well-designed agent has both tools and chooses based on the query type. The agent reasons: "This question is about our refund policy—that's internal, so I'll use the RAG tool" vs. "They're asking about today's weather—that's public info, I'll search the web."
Key Takeaways
-
RAG bridges the knowledge gap between static training data and dynamic, private information your agent needs.
-
Three building blocks: Embeddings (meaning as vectors), Vector Databases (fast similarity search), Chunking (splitting docs intelligently).
-
The pattern is simple: Query → Retrieve → Augment → Generate. When something goes wrong, debug each step.
-
Use managed services for production. OpenAI Vector Stores, Vertex AI RAG Engine, and similar offerings handle the infrastructure so you can focus on your application.
-
Advanced techniques (hybrid search, query rewriting, re-ranking, HyDE) matter when basic RAG hits edge cases.
-
RAG is just another tool in your agent's toolbox. Combine it with web search and other tools for a capable assistant.
References
- OpenAI Retrieval API Guide — Official documentation for Vector Stores and semantic search
- AWS: What is RAG? — Comprehensive overview of RAG architecture and benefits
- Google Cloud: Retrieval-Augmented Generation — Google's perspective on RAG and Vertex AI integration
- Vertex AI RAG Engine Documentation — Technical guide for Google's managed RAG solution
- Lost in the Middle: How Language Models Use Long Contexts — Research paper on attention degradation in long contexts
Next: Memory & Persistence—giving your agent memory that persists across conversations.