Context Engineering

When More Information Makes Things Worse

In early 2023, a developer shared a cautionary tale on Twitter. He'd configured AutoGPT to research competitors and write a market analysis, then left it running overnight. By morning, the agent had made 847 API calls, accumulated $427 in charges, and produced a report that was 90% duplicated paragraphs with zero usable insights.

The failure mode was instructive: the agent kept re-reading the same websites, forgetting it had already analyzed them. With each iteration, it stuffed more text into its context window until the signal—the actual insights—drowned in redundant HTML. The agent was working. It just wasn't remembering.

This wasn't a bug in AutoGPT. It was a missing discipline that we now call context engineering.

By late 2024, the landscape had shifted. Google's Deep Research agent could process hundreds of sources and produce coherent 30-page reports. Anthropic's Claude handled million-token contexts without losing critical information. The difference wasn't just larger context windows—it was learning to treat context as a resource to be managed, not just a bucket to be filled.

This chapter covers that discipline. We'll start with what context actually is and why it degrades, then work through techniques to manage it—from simple trimming to sophisticated patterns used in production agents. By the end, you'll have a toolkit for building agents that stay sharp even in long-horizon tasks.

Understanding Context

Before we can engineer context, we need to be precise about what it is.

Context is everything the model can see when generating a response. The system prompt. The user's message. The outputs from previous tool calls. Retrieved documents. Conversation history. Memory entries. All of it, concatenated together, forms the input that the model reasons over.

A useful mental model: think of context as a detective's case file. Every piece of evidence, every witness statement, every photograph—that's what the detective has to work with. Leave out a crucial detail, and they'll miss the connection. Include too much irrelevant material, and the important clues get buried.

We've touched on context throughout previous chapters, but two facts make dedicated context engineering essential.

Context Windows Are Finite

Models advertise impressive context windows—128k tokens, 1 million, even 2 million. It sounds like more than you'd ever need. Then you start building.

An autonomous coding agent loads a 50,000-line codebase. It maintains conversation history across 20 turns. It accumulates outputs from 30 file reads. It includes test results and error logs. Suddenly that "unlimited" context is filling up fast.

For long-horizon tasks—research projects spanning days, coding agents working through complex refactors—you will hit the limit. The question is what you do about it.

Context Degrades Before It Fills

Here's a failure mode that's more insidious than running out of space: your agent has all the information it needs—you can see the answer right there in the context—but the model ignores it and produces something wrong.

This isn't hallucination. It's attention failure.

Stanford researchers documented this in a 2023 paper called "Lost in the Middle." They found that language models exhibit a U-shaped attention pattern: strong attention to the beginning of the context, reasonable attention to the end, and significantly weaker attention to everything in between. Bury critical information on page 15 of a 30-page context, and the model often behaves as if it isn't there.

The implication is counterintuitive: more context can make performance worse. A focused 5,000-token context often outperforms a bloated 50,000-token context containing the same critical information buried in noise.

The Context Paradox

Filling your context window isn't the goal. Filling it with the right information is.

This is why context engineering matters. It's not just about fitting within limits—it's about keeping the signal-to-noise ratio high enough that the model can actually use what you give it.

Providing Good Context

Before we discuss managing context, a foundational point: none of the techniques ahead matter if the context you're providing isn't good in the first place.

We've been building this understanding throughout the series. In Chapter 4, we learned to write system prompts that are clear and specific rather than vague and verbose. In Chapter 6, we crafted tool descriptions that tell the model exactly when and how to use each capability. In Chapter 8, we built RAG systems that retrieve relevant documents rather than dumping everything tangentially related. In Chapter 9, we designed memory systems that surface the right facts at the right time.

The common thread across all of it: quality over quantity.

A concise, relevant 100-token instruction outperforms a rambling 1,000-token essay. Three carefully retrieved documents beat thirty tangentially related ones. A focused system prompt beats a comprehensive one that buries the important instructions.

Context engineering extends this principle beyond initial setup. It's not just about what you put in—it's about continuously curating what stays in, what gets compressed, and what gets fetched on demand as the conversation evolves.

Techniques for Managing Context

With the principles established, let's get concrete. The techniques that follow fall into three broad categories:

Removing unnecessary context—deleting what you no longer need
Condensing necessary context—compressing information without losing what matters
Dynamically including relevant context—fetching on demand rather than loading upfront

Most production agents combine several of these. We'll work through them from simplest to most sophisticated.

Removing Unnecessary Context

The simplest optimization is also the most overlooked: just delete what you don't need.

Trimming

Imagine a customer support agent that's been running for two hours. The conversation has accumulated 200 messages, and you're approaching your context limit. What do you do?

The brute-force solution: keep only the last N messages and throw away the rest. After every turn, check the message count. If it exceeds your threshold—say, 20 messages—remove the oldest ones while preserving the system prompt.

This is a blunt instrument. You will lose information. Your agent will sometimes fail to recall things discussed earlier—"I don't remember you mentioning that." But for many use cases, that's fine. A customer asking about their order doesn't need the agent to remember a question from an hour ago. What matters is recent context.

Trimming works well for stateless, transactional interactions. It works poorly when continuity matters.

Pruning

Trimming is indiscriminate—it removes old messages regardless of their value. Pruning is smarter. It removes specific pieces of context that have served their purpose.

Consider a coding agent that reads a 2,000-token config file to extract an API key. Once extracted, the full JSON just sits in context, taking up space. The agent will never reference it again. So why keep it?

After the agent extracts what it needs, replace the full output with a compact summary: "✓ Found API key in config.json." Two thousand tokens become ten.

Claude Code does this. After a successful file read, it often replaces the full content with a note about what was found. The information that mattered—the API key—is preserved in the agent's subsequent reasoning. The information that didn't—2,000 tokens of JSON structure—is gone.

The tricky part is knowing when to prune. Too early, and the agent might need that data again. Too late, and you've wasted tokens for several turns. A reasonable heuristic: if the agent has already acted on the information—made a decision, written code, answered a question—it's probably safe to compress.

Condensing Necessary Context

Sometimes deletion isn't an option. You need the information, just in a smaller form.

Summarization

The idea is simple: periodically ask a model to compress the conversation so far. Ten messages become two sentences. Five thousand tokens become fifty.

The danger is what gets lost. I once watched a debugging session get summarized as "User had an authentication issue; agent helped resolve it." Technically accurate. Completely useless when the user came back a week later asking "what was that fix we tried?" The specific error code, the config change that worked, the three approaches that didn't—all gone.

Summarization works when the arc matters more than the details. A user exploring options, asking general questions, gradually narrowing toward a decision—that compresses well. A user debugging a specific issue, where every error message and attempted fix matters—that doesn't.

One practical note: you don't need your most powerful model for summarization. A smaller model like GPT-4o-mini or Claude Haiku handles compression adequately at a fraction of the cost. Save the expensive model for the actual reasoning.

Memory Systems

Summarization compresses conversations. Memory systems take a different approach: instead of compressing everything, extract only the facts that matter and store them separately.

The difference is subtle but important. A summary might say "User discussed dietary preferences." A memory entry says {dietary_restrictions: ["peanuts", "shellfish"]}. The summary gives narrative; the memory gives retrievable facts.

As the conversation progresses, the agent identifies important information—preferences, decisions, constraints—and writes them to a persistent store. In future turns, instead of carrying the full conversation history, it retrieves only the relevant memory entries.

This scales to sessions spanning days or weeks. The agent doesn't need to remember that on Tuesday you discussed allergies. It just needs to retrieve the allergy list when suggesting a restaurant.

The infrastructure requirements are heavier—you need a database, probably a vector store for semantic retrieval. And extraction quality depends on the model's ability to identify what's worth remembering. But for long-running agents, memory systems are often the only practical option.

We covered memory architecture in detail in Chapter 9.

Dynamically Including Relevant Context

The techniques above assume content is already in context and needs to be reduced. But there's another approach: don't load everything upfront. Fetch information when you need it.

File Pointers

Suppose you're building a coding agent for a repository with 50 files. The naive approach: dump all 50 files into the initial prompt. That's easily 100,000 tokens before the user even asks a question.

The smarter approach: give the agent a list of file paths and a read_file tool. "Here are the 50 files in this repo. Use read_file(path) when you need to see one."

Now the agent reads three files instead of fifty. Peak context stays around 10k tokens. The savings compound over a long session—every turn that would have carried 100k tokens now carries 10k.

The risk is that the agent might not know to look at a file that turns out to be relevant. You can mitigate this with good file descriptions, or with a semantic search layer that suggests relevant files based on the current query.

Semantic Search and RAG

File pointers work when you have a bounded set of resources. But what if your knowledge base has millions of documents?

You can't list them all. You can't even summarize them all. The only practical approach is retrieval: embed the user's query, search a vector database, and pull back the top few matches.

This is RAG—Retrieval-Augmented Generation—which we covered in Chapter 8. In the context of context engineering, the key insight is that RAG lets you have an effectively infinite knowledge base while keeping per-turn context small. The user asks "How do I reset the server?" The system retrieves three relevant docs, injects them into context, and the model answers based on those specific documents.

The quality depends on your retrieval. Bad embeddings or poorly chunked documents mean the right information might not surface. But when it works, RAG is the only way to scale knowledge bases beyond what any context window could hold.

Dynamic Tool Selection

Context isn't just documents and conversation history. Tool definitions take up space too. If your agent has 100 tools, each with a detailed description and parameter schema, you might be spending 20,000 tokens just listing what the agent could do.

One solution: don't register all 100 tools. Use a lightweight classifier—or even just keyword matching—to predict which tools the current query might need, then only expose those. A query about AWS deployment probably needs deploy_aws, check_status, maybe read_file. It probably doesn't need send_email or schedule_meeting.

This reduces token count and, surprisingly, often improves tool selection accuracy. Given 100 options, models sometimes pick the wrong one. Given 5 relevant options, they rarely miss.

The risk is hiding a tool the agent actually needs. If your classifier is too aggressive, the agent might not have access to the right capability. Start conservative—only filter when you're confident—and tune based on observed failures.

Delegation

Sometimes the cleanest solution is to not put information in your context at all. Hand it to a separate agent, let them process it, and take back only the result.

The main agent needs to analyze a 50-page PDF. Loading it directly would consume 40,000 tokens. Instead, spin up a "reader agent" with a fresh context containing only the PDF. The reader processes it, extracts what matters, and returns a 200-token summary. The main agent's context stays clean.

This pattern is especially useful for tasks with distinct phases—research, then analysis, then writing—where each phase needs different context. Rather than accumulating everything in one ballooning context, each phase gets its own focused workspace.

We covered multi-agent delegation in Chapter 12.

These techniques aren't mutually exclusive. Most production agents layer several: trimming for conversation history, pruning for tool outputs, RAG for knowledge retrieval, delegation for heavy subtasks. The right combination depends on where your tokens are going. Measure first, then optimize.

Advanced Patterns

The techniques above are surgical—they address specific problems. The patterns in this section are more architectural. They change how an agent fundamentally relates to its context, combining multiple techniques into coherent strategies.

These patterns emerged from production agents—Claude Code, Devin, Cursor—that needed to maintain coherence across hundreds of turns. If you're building something that runs for more than a few minutes, these are worth understanding.

Note-Taking

Here's a problem: your coding agent is 50 turns into a complex refactor. It's made dozens of decisions—which files to modify, what approach to take, what to defer until later. But you've been trimming aggressively to stay under context limits. The actual conversation history only goes back 10 turns. How does the agent remember what it decided on turn 15?

The answer is to externalize state. Instead of relying on conversation history as memory, the agent maintains a notes.md file—a scratchpad it reads at the start of each turn and updates at the end.

The notes aren't a transcript. They're structured state: what's done, what's in progress, what decisions were made and why, what's next.

# Task: Refactor authentication module
 
## Completed
- Extracted User model to separate file
- Added password hashing (bcrypt, not MD5)
- Updated all imports in affected files
 
## In Progress  
- JWT token generation
 
## Key Decisions
- Token expiry: 24 hours (client requested)
- Refresh tokens: deferred to phase 2
 
## Next
- Implement login endpoint
- Add rate limiting

With notes as the source of truth, conversation history becomes disposable. You can trim down to the last 3 messages because the state lives in the file, not the chat.

Claude Code does this with CLAUDE.md. Devin maintains persistent memory of decisions. Cursor's agent mode has an internal scratchpad. The pattern is everywhere because it solves a fundamental problem: conversations are ephemeral, but tasks require continuity.

Planning

Watch an agent without a plan work on a complex task. Turn 1: "I should probably start by researching..." Turn 5: "So as I mentioned, the approach is to research first, then..." Turn 12: "To recap, my plan is to research, then outline, then..." The same reasoning, repeated endlessly, burning tokens.

The fix is obvious once you see it: write the plan down once.

At the start of a complex task, the agent reasons carefully about the approach and produces an explicit plan. In subsequent turns, instead of re-deriving the strategy, it just references the plan: "Proceeding with step 3 of 5."

Two hundred tokens of plan reference instead of two thousand tokens of repeated reasoning.

But planning does something deeper than saving tokens. The plan becomes a form of memory. Once it's written, you can aggressively trim the reasoning that produced it. The decision to use Docker for deployment doesn't need to live in the conversation history—it's in the plan. The plan is the source of truth for what was decided.

This works for any task with distinct phases: research projects, document generation, code refactors. If you'd make a checklist doing it manually, your agent should make one too.

Reflection

Trimming follows fixed rules: keep the last N messages. Pruning follows heuristics: compress tool outputs after they're used. But what if the optimal strategy depends on how the task is evolving?

Reflection makes the agent its own context manager. Every N turns, it pauses to explicitly consider: What in my context is still relevant? What can be safely removed? What will I need in the next few turns?

Based on this self-assessment, the agent prunes, summarizes, or reorganizes its context.

The cost is an extra API call for the reflection itself. And weaker models struggle with this kind of meta-reasoning—they'll either keep everything or delete things they shouldn't. But for capable models on open-ended tasks, reflection lets the context management adapt to the actual task instead of following predetermined rules.

Context Economics

Everything we've discussed so far has a practical motivation beyond just "fitting in the context window." Context costs money.

Let's make it concrete. Claude Sonnet charges $3 per million input tokens. An agent processing 50,000 tokens per turn, running 100 turns per task, handling 1,000 tasks per day: that's 5 billion tokens, or $15,000 per month. And that's just input tokens—output tokens cost more.

Context engineering isn't just about making agents work better. It's about making them economically viable.

Context Caching

Notice something wasteful in a typical agent session? The system prompt is the same every turn. The codebase context is the same every turn. Yet you're paying to process those 10,000 tokens again and again.

Context caching fixes this. Both Anthropic and Google now let you "save" a prefix after processing it once. Subsequent requests reference the cached computation instead of reprocessing from scratch. The discount is substantial—roughly 90% off for cached tokens.

An agent making 50 requests per session with a 10,000-token system prompt saves 90% on 490,000 tokens through caching alone. Over thousands of sessions, that's real money.

The catch: caches expire. Anthropic's caches last about 5 minutes; Google's last longer. And the prefix must be exactly identical—change one character and you rebuild the cache. This means caching works best for truly static content: system prompts, reference documentation, few-shot examples. It doesn't work for conversation history or tool outputs that change every turn.

Model Routing

Not every question needs your most expensive model.

"What's the weather in Tokyo?" doesn't require deep reasoning. A fast, cheap model handles it fine. "Architect a distributed system for handling 10 million concurrent users" is a different story—you want your best model on that.

The price differences are dramatic. GPT-4o-mini costs $0.15 per million tokens. GPT-4o costs $2.50. Claude Opus costs $15. For queries that don't need the expensive model, routing to a cheaper one cuts costs by 10-100x.

The challenge is building a router that classifies correctly. Too aggressive, and you send hard queries to weak models that fumble them. Too conservative, and you're not saving much. Most production systems start conservative—default to the expensive model—and gradually identify query patterns that can be safely routed cheaper.

There's also context window sizing. Some providers charge more for larger context windows even within the same model. If your task genuinely only needs 8k tokens, using the 8k-context variant instead of the 128k variant can save substantially.

Putting It Into Practice

Context engineering becomes intuitive with practice. The workflow is always the same:

Measure — Log token usage per turn. Where is context going?
Identify — What's signal (used for decisions)? What's noise (never referenced)? What's redundant?
Optimize — Apply the appropriate technique from this chapter
Verify — Re-measure. Did task success improve? Did costs drop?

The 80/20 Rule

In most agents, 20% of the context provides 80% of the value. Your job is to find that 20% and optimize the rest.

Exercise: Research Assistant

To consolidate what you've learned, try building a research assistant that processes 10 articles and writes a report. This exercise combines most techniques from this chapter.

The Challenge

Build an agent that:

Searches the web for 10 relevant articles on a topic
Reads and analyzes each article
Writes a comprehensive report

Naively loading all 10 articles would exceed 100k tokens. Your task: engineer the context flow so peak usage stays under 25k tokens.

Suggested Architecture

Main Orchestrator: Coordinates the research process
Reader Agent: Analyzes individual articles in isolated context
Notes System: research_notes.md for persistent state
Semantic Retrieval: Finds relevant notes when writing

The Flow

Phase 1: Research (Delegation + Note-Taking)

Phase 2: Writing (Dynamic Retrieval + Context Caching)

The naive approach—loading all 10 articles into context—would hit 150,000 tokens and fail around article 8. The optimized approach peaks at 25,000 tokens and completes the full task at roughly 85% lower cost.

Start simple when you build this. Implement just summarization first and measure. Then add note-taking. Then delegation. Watch the token counts at each stage. The cumulative impact is often surprising.

Key Takeaways

The agents that succeed at scale share a common trait: they're disciplined about context. They don't just stuff information in and hope the model finds what it needs. They actively manage what's in the window—removing what's no longer useful, compressing what's verbose, retrieving what's needed on demand.

This isn't glamorous work. It's maintenance. Cleaning up context, organizing information, measuring token usage, tuning thresholds. But it's the difference between an agent that works in demos and an agent that works in production.

Start by measuring. Run your agent on representative tasks and log token counts per turn. Find out where the bloat is—conversation history? Tool outputs? Static resources? Then pick one technique and implement it. Measure again. The improvements are often larger than expected.

Context is a finite, expensive resource. Treat it that way, and your agents will be sharper, cheaper, and more reliable.

Next: The Agent Ecosystem