The modern AI agent is built on a fundamentally new computing primitive: the Large Language Model. Unlike traditional CPUs that execute deterministic instructions, the LLM operates probabilistically—it predicts the next token.

Thinking in Tokens

Before you can build agents, you must understand how the model sees the world: as tokens.

from tiktoken import encoding_for_model
 
enc = encoding_for_model("gpt-4o")
 
text = "Hello, world!"
tokens = enc.encode(text)
 
print(f"Text: {text}")
print(f"Tokens: {tokens}")  # [15339, 11, 1917, 0]
print(f"Token count: {len(tokens)}")  # 4

Key Insight

One word ≠ one token. Common words may be single tokens, while rare words split into multiple subword units. This matters for:

  • Cost: You pay per token, not per word
  • Context limits: Your 128K context window is measured in tokens
  • Latency: Generation time scales with output tokens

The Context Window

The context window is the LLM's working memory—everything it can "see" at once.

ModelContext Window~Pages of Text
GPT-4o128K tokens~300 pages
Claude 3.5200K tokens~500 pages
Gemini 1.5 Pro1M tokens~2,500 pages

Context Engineering

Effective agents manage context carefully:

  1. Pack densely — Include only what's needed
  2. Structure clearly — Use headers, sections, XML tags
  3. Recency matters — Recent context has more influence

KV Cache: The Performance Lever

When you send a prompt, the model computes key-value pairs for every token. The KV Cache stores these computations so subsequent tokens don't recompute from scratch.

First request:  [System prompt] [User message]

                Compute KV for all tokens

                Generate response
 
Second request: [System prompt] [User message] [Response] [User message 2]

                Reuse cached KV for prefx      Compute only new tokens

Implications for Agents

  • Prefix caching reduces latency by 50%+ on repeated prompts
  • System prompts should be stable to maximize cache hits
  • Streaming feels faster because first token arrives before full computation

Trade-offs in System Design

FactorSmall Model (e.g., GPT-4o-mini)Frontier Model (e.g., GPT-4o)
Latency~100ms first token~500ms first token
Cost~$0.15/M input~$2.50/M input
CapabilityGood for simple tasksRequired for complex reasoning

The art of agent architecture is choosing the right model for each sub-task.


Next Chapter: Deterministic Constraints on Stochastic Outputs — How to tame the randomness with JSON Schema and structured outputs.