The LLM as a Kernel | LM Frontier

The modern AI agent is built on a fundamentally new computing primitive: the Large Language Model. Unlike traditional CPUs that execute deterministic instructions, the LLM operates probabilistically—it predicts the next token.

Thinking in Tokens

Before you can build agents, you must understand how the model sees the world: as tokens.

from tiktoken import encoding_for_model
 
enc = encoding_for_model("gpt-4o")
 
text = "Hello, world!"
tokens = enc.encode(text)
 
print(f"Text: {text}")
print(f"Tokens: {tokens}")  # [15339, 11, 1917, 0]
print(f"Token count: {len(tokens)}")  # 4

Key Insight

One word ≠ one token. Common words may be single tokens, while rare words split into multiple subword units. This matters for:

Cost: You pay per token, not per word
Context limits: Your 128K context window is measured in tokens
Latency: Generation time scales with output tokens

The Context Window

The context window is the LLM's working memory—everything it can "see" at once.

Model	Context Window	~Pages of Text
GPT-4o	128K tokens	~300 pages
Claude 3.5	200K tokens	~500 pages
Gemini 1.5 Pro	1M tokens	~2,500 pages

Context Engineering

Effective agents manage context carefully:

Pack densely — Include only what's needed
Structure clearly — Use headers, sections, XML tags
Recency matters — Recent context has more influence

KV Cache: The Performance Lever

When you send a prompt, the model computes key-value pairs for every token. The KV Cache stores these computations so subsequent tokens don't recompute from scratch.

First request:  [System prompt] [User message]
                      ↓
                Compute KV for all tokens
                      ↓
                Generate response
 
Second request: [System prompt] [User message] [Response] [User message 2]
                      ↓
                Reuse cached KV for prefx      Compute only new tokens

Implications for Agents

Prefix caching reduces latency by 50%+ on repeated prompts
System prompts should be stable to maximize cache hits
Streaming feels faster because first token arrives before full computation

Trade-offs in System Design

Factor	Small Model (e.g., GPT-4o-mini)	Frontier Model (e.g., GPT-4o)
Latency	~100ms first token	~500ms first token
Cost	~$0.15/M input	~$2.50/M input
Capability	Good for simple tasks	Required for complex reasoning

The art of agent architecture is choosing the right model for each sub-task.

Next Chapter: Deterministic Constraints on Stochastic Outputs — How to tame the randomness with JSON Schema and structured outputs.