The modern AI agent is built on a fundamentally new computing primitive: the Large Language Model. Unlike traditional CPUs that execute deterministic instructions, the LLM operates probabilistically—it predicts the next token.
Thinking in Tokens
Before you can build agents, you must understand how the model sees the world: as tokens.
from tiktoken import encoding_for_model
enc = encoding_for_model("gpt-4o")
text = "Hello, world!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}") # [15339, 11, 1917, 0]
print(f"Token count: {len(tokens)}") # 4Key Insight
One word ≠ one token. Common words may be single tokens, while rare words split into multiple subword units. This matters for:
- Cost: You pay per token, not per word
- Context limits: Your 128K context window is measured in tokens
- Latency: Generation time scales with output tokens
The Context Window
The context window is the LLM's working memory—everything it can "see" at once.
| Model | Context Window | ~Pages of Text |
|---|---|---|
| GPT-4o | 128K tokens | ~300 pages |
| Claude 3.5 | 200K tokens | ~500 pages |
| Gemini 1.5 Pro | 1M tokens | ~2,500 pages |
Context Engineering
Effective agents manage context carefully:
- Pack densely — Include only what's needed
- Structure clearly — Use headers, sections, XML tags
- Recency matters — Recent context has more influence
KV Cache: The Performance Lever
When you send a prompt, the model computes key-value pairs for every token. The KV Cache stores these computations so subsequent tokens don't recompute from scratch.
First request: [System prompt] [User message]
↓
Compute KV for all tokens
↓
Generate response
Second request: [System prompt] [User message] [Response] [User message 2]
↓
Reuse cached KV for prefx Compute only new tokensImplications for Agents
- Prefix caching reduces latency by 50%+ on repeated prompts
- System prompts should be stable to maximize cache hits
- Streaming feels faster because first token arrives before full computation
Trade-offs in System Design
| Factor | Small Model (e.g., GPT-4o-mini) | Frontier Model (e.g., GPT-4o) |
|---|---|---|
| Latency | ~100ms first token | ~500ms first token |
| Cost | ~$0.15/M input | ~$2.50/M input |
| Capability | Good for simple tasks | Required for complex reasoning |
The art of agent architecture is choosing the right model for each sub-task.
Next Chapter: Deterministic Constraints on Stochastic Outputs — How to tame the randomness with JSON Schema and structured outputs.