Observability & Evaluation

"It worked yesterday."

This is the defining frustration of agent development. You didn't change the prompt. You didn't update the model. The tool code is identical. But the agent that flawlessly handled customer refunds last week now apologizes excessively, calls the wrong tool, and occasionally hallucinates a policy you've never had.

Traditional software has a comforting property: given the same input, you get the same output. Bugs are reproducible. Stack traces point to line numbers. When something fails, you know where to look.

Agents break these assumptions. The same prompt produces different outputs across runs. Failures appear randomly—20% of the time, or only when the context window is nearly full, or only when the user phrases their request a certain way. When something breaks, there's no stack trace. There's only a conversation where the agent seemed confident about every wrong decision it made.

This chapter is about building systems to answer two questions: What happened? and Was it good? The first is observability. The second is evaluation. They're deeply connected—you can't evaluate what you can't see, and visibility without judgment just creates expensive log storage.


What You Can't See Will Hurt You

Here's a failure pattern that happens more often than anyone admits.

A customer support agent goes live. It handles refund requests beautifully in testing. But in production, users start reporting strange behavior: the agent approves refunds for orders that don't qualify, or worse, tells users they're eligible for refunds the company doesn't offer.

The team investigates. The prompt looks fine. The model hasn't changed. What's happening?

After hours of digging, someone finally looks at the actual conversation logs. The agent was calling search_knowledge_base("refund policy"), but the knowledge base had been updated the previous week with a new draft document that was never finalized. The agent was reading and correctly following the wrong policy.

Without the ability to see which tool returned which data at which moment in the conversation, this bug would have taken weeks to find. The model wasn't broken. The prompt wasn't broken. The data the model received was broken—and no amount of staring at the prompt would reveal that.

This is why observability comes first. You need the complete record.

Traces: The Complete Picture

Every interaction with your agent should leave an audit trail: not just "user asked X, agent said Y," but every intermediate step.

A trace captures the full timeline of a single task or conversation. Each step in the trace is called a span: an LLM call, a tool invocation, a response generation. Spans are connected—one leads to the next, branches fork, and you can follow the path from input to output.

For each span, capture:

  • Inputs: The exact request (prompt, tool arguments)
  • Outputs: The response or return value
  • Latency: Time to first token, total duration
  • Tokens: Input, output, cached
  • Metadata: Model version, temperature, timestamp, session ID

When the refund flow breaks, you can open the trace and see exactly where: Did get_order return the wrong data? Did search_policy fail silently? Did the agent misinterpret a valid result?

Several mature platforms handle this:

  • Langfuse — Open-source, self-hostable. Good if you want to own your data.
  • LangSmith — Deep LangChain integration, powerful debugging UI.
  • Arize Phoenix — Built on OpenTelemetry, good for teams already using it.

The specific tool matters less than having something.

Instrument Early

Adding observability after a production incident is firefighting. Adding it before launch is engineering. Start with the basics: every LLM call, every tool invocation, every final response. Expand coverage when you encounter bugs you can't diagnose.


Evaluation: From Vibes to Evidence

Observability tells you what happened. Evaluation tells you whether it was any good.

In the early days of building an agent, evaluation is informal. You test a few cases, eyeball the outputs, and ship when it "feels right." This works surprisingly well—until it doesn't. The breaking point usually comes when the team makes a change and users report that the agent "feels worse," but no one can articulate what's different or verify anything objectively.

The Anthropic team hit this wall with Claude Code. Early development relied on dogfooding—employees using the tool and reporting issues. But as the agent matured, the team couldn't tell whether a new prompt was better or worse without running structured tests. They built eval suites for specific behaviors: conciseness, file edit quality, and eventually complex behaviors like "over-engineering." Those evals became the language for discussing improvements.

The lesson: evaluation is how you turn intuition into something you can measure, communicate, and improve.

The Shape of an Eval

An evaluation has four parts.

Test cases define specific scenarios. Each case has an input (what you give the agent), expected behavior (what should happen), and grading criteria (how you decide if it worked). A test case might be: "User asks about order status for order #12345. Agent should retrieve the order and report status accurately."

Trials run each test case multiple times. This sounds wasteful if you come from traditional testing, where you run once and check the result. But agents are non-deterministic. A test might pass 80% of the time. Running five trials per case gives you a pass rate instead of a pass/fail binary.

Graders judge the outputs. These can be code-based (did the JSON parse correctly?), model-based (use another LLM to assess quality), or human (have a person review). More on graders shortly.

Results aggregate scores across trials and cases. "The agent passes 87% of refund scenarios across 200 trials" is actionable. "The agent seems okay" is not.

Handling Non-Determinism

A test case that passes 80% of the time is neither passing nor failing—it's probabilistic. Two metrics help reason about this:

pass@k measures the chance of at least one success in k attempts. If your agent has 60% per-trial success, pass@3 is about 94%—three tries gives you good odds of at least one good result. This metric matters when retrying is acceptable: research tools, internal assistants, draft generation.

pass^k measures the chance of succeeding every time across k trials. With 60% per-trial success, pass^3 is about 22%—demanding consistency is much harder. This metric matters for customer-facing systems where every interaction should work.

A 90% pass@1 rate sounds great until you realize it means 1 in 10 users gets a broken experience.


The Art of Grading

Not all graders are equal. Choosing the right grading strategy takes judgment.

When Code Is Enough

Start with code-based graders wherever possible. They're deterministic, fast, and cheap.

def grade_refund_response(output: str, expected_amount: float) -> bool:
    """Did the agent state the correct refund amount?"""
    import re
    amounts = re.findall(r'\$[\d,]+\.?\d*', output)
    return any(
        float(a.replace('$', '').replace(',', '')) == expected_amount 
        for a in amounts
    )

This works for format validation (is it valid JSON?), presence checks (did it include the required disclaimer?), exact matches (is the answer "42"?), and tool call validation (did it call the right function with the right arguments?).

But code graders break on anything requiring judgment. They can't assess whether a response was helpful, whether the tone was appropriate, or whether the answer was mostly right with a minor error. For those, you need something more flexible.

LLM-as-Judge

Using another LLM to evaluate your agent's outputs is powerful—and treacherous.

The idea is simple: construct a prompt that presents the agent's output, the expected behavior, and grading criteria. The judge LLM returns a structured assessment.

Here's a rubric-based approach that works well for customer service scenarios:

JUDGE_PROMPT = """
You are evaluating an AI customer service agent.
 
## Conversation
{transcript}
 
## Grading Criteria
 
For each dimension, score 1-5:
 
**CORRECTNESS**: Is the information accurate?
- 5: Fully accurate
- 3: Minor inaccuracies that don't affect the outcome
- 1: Major errors or fabrications
 
**HELPFULNESS**: Did the agent solve the user's problem?
- 5: Problem completely resolved
- 3: Partially addressed, user would need follow-up
- 1: Did not address the actual need
 
**SAFETY**: Did the agent stay within policy?
- 5: Fully compliant
- 3: Borderline or unclear case
- 1: Promised something outside policy
 
If you cannot assess a dimension, respond "unknown" rather than guessing.
 
Respond as JSON:
{"correctness": N, "helpfulness": N, "safety": N, "pass": true/false}
"""

The explicit rubric is important. Without it, LLM judges make up their own criteria, and those criteria drift between runs. The "unknown" escape hatch is also critical—LLMs will confidently guess rather than admit uncertainty unless you give them permission to abstain.

Calibrate against humans. Before trusting your judge, have humans grade 50-100 examples. Run the LLM judge on the same set. Compare. Where do they agree? Where do they diverge? If the LLM is too lenient (common), add explicit failure examples to the prompt: "A response that does X should fail." Iterate until human and LLM grades align on at least 80% of binary decisions.

Watch for biases. LLM judges exhibit predictable quirks. Length bias: longer responses score higher, even when brevity is better. Position bias: in pairwise comparisons, some models favor the first option. Self-preference: models prefer outputs from their own family. If you're using Gemini to judge Gemini, be aware. Consider a different model for judging, or test for bias by swapping positions and checking for consistency.

Use a stronger judge. Your judge should be at least as capable as the agent being evaluated. Gemini Pro evaluating Flash is fine. Flash evaluating Pro will miss subtle issues.

When Humans Must Decide

Human graders are the gold standard—and prohibitively expensive for routine use.

Reserve humans for calibrating LLM judges (the initial 50-100 examples), for subjective quality assessments where you're defining policy rather than testing it, and for high-stakes domains (legal, medical, financial) where errors carry real consequences.

Keep human batches small—10-20 at a time. Reviewer fatigue is real. By the 80th response, people are rubber-stamping. Rotate reviewers. Use human time strategically, not routinely.

Layering Graders

The best systems combine approaches. Use code graders as a first filter—they're fast and catch obvious failures. Escalate to LLM judges for nuanced cases. Reserve human review for ambiguous edge cases or periodic calibration.


What Matters to Measure

The Path vs. The Outcome

Here's a common eval mistake: testing whether the agent followed specific steps.

"First it should call get_order, then check_policy, then process_refund—in that order."

This breaks immediately. What if a smarter model realizes it already has the policy in context and skips the redundant lookup? What if it finds a more efficient path you didn't anticipate? Frontier models are creative. Overly prescriptive tests penalize intelligence.

The better approach: grade the outcome. After the conversation, is there a refund in the database for the correct amount? Did the user receive confirmation? Did the agent stay within policy? The path matters less than the destination—unless specific steps are legally or compliance-required.

The Anthropic team learned this building τ2-bench for agent evaluation. Claude Opus 4.5 initially "failed" a flight booking task—it didn't follow the expected steps. But on manual review, it had found a loophole in the policy that resulted in a better outcome for the user. The test was broken, not the agent.

Capability vs. Regression

You need two kinds of eval suites, and they serve different purposes.

Capability evals push the frontier. These should start at low pass rates—30%, 50%. You add hard cases, edge cases, adversarial cases. The goal is finding where your agent fails so you can improve it. When capability evals approach 95% pass rates, they're saturated. Time to add harder tests.

Regression evals protect the baseline. These should be near 100%. They encode known-working behavior and alert you when something breaks. Every production bug you fix should become a regression test: encode the failure, verify the fix, prevent it from recurring.

As capability evals saturate—your agent finally passes most cases—graduate them to the regression suite. What was once aspirational becomes a new baseline to protect.

Balance Positive and Negative

Test both directions. Does the agent act when it should? Does it not act when it shouldn't?

One-sided evals produce one-sided optimization. If you only test "approve valid refunds," you might build an agent that approves everything. The Anthropic team hit this building Claude.ai's web search. They needed tests for queries where Claude should search (current events) and queries where it shouldn't (basic knowledge). Testing only "should search" produced an agent that searched compulsively.


Building an Eval System from Nothing

If you have no evals today, don't try to build the perfect system. Start ugly. Iterate fast.

Start with Failures

Your first test cases should come from things that have already broken. Dig through support tickets, bug reports, and your own notes from manual testing. Each failure becomes a regression test: "This specific thing should never happen again."

Twenty cases is enough to start. You're not trying to measure overall quality yet—you're trying to prevent known failures from recurring. Every test case encodes a lesson learned.

Add Core Use Cases

Next, add the happy path. What are your top five user scenarios? The flows you demo to stakeholders? The cases that must work for the product to be viable?

If these break, you want to know immediately. These become your baseline.

Make Criteria Unambiguous

Here's the test: if two engineers reviewed the same output, would they independently agree on whether it passed?

Vague criteria like "the response should be helpful" produce noisy metrics. You'll waste time investigating "failures" that are actually fine. Rewrite as observable outcomes: "The response must include the order status. The refund amount must match the original purchase. The agent must not promise expedited shipping."

Automate Ruthlessly

Manual evals don't scale. Run your suite automatically on every significant prompt change, before deploying new model versions, and on a weekly schedule to catch drift.

Set alerts for significant drops. A 5% decline in a noisy metric might be statistical variance. A 20% decline in your core suite means something is wrong.

Treat It Like Code

Eval suites require maintenance. Add cases when you encounter new failure modes. Retire cases when features change or policies evolve. Upgrade your LLM judges when more capable models become available. Neglected suites drift from reality—they become cargo cult metrics that go up and down without correlating with actual quality.


Debugging in Production

Observability and evaluation converge when something breaks in production. Here's how to handle it.

A Debugging Story

Let's make this concrete. Your support agent is live. A user reports that it "forgot" a detail they mentioned earlier in the conversation.

You pull the trace. It shows a long conversation—15 turns—where the user initially mentioned being a premium member and asked about expedited shipping. By turn 12, the agent suggested standard shipping "since there's no membership information available."

First thought: context window overflow? You check the token count: 18K tokens, well under the 128K limit.

Second thought: you look at turn 10. The agent had called get_shipping_options, which returned a large payload of all shipping options—6,000 tokens of irrelevant JSON. This pushed earlier context—including the "I'm a premium member" message—further from the model's immediate attention.

The tool wasn't broken. The prompt wasn't broken. The context engineering was broken—a verbose tool response diluted the important information. The fix: modify get_shipping_options to filter by membership status before returning, reducing the payload from 6,000 tokens to 400.

Without the trace, you'd be guessing.

The Debugging Checklist

When an agent fails in production, work through this systematically:

  1. Reproduce first. Replay the exact inputs. Does it fail again? If not, you're dealing with low-probability behavior—maybe a pass^k issue—rather than a consistent bug.

  2. Find the wrong turn. Read the transcript from the beginning, looking for the first moment the agent went astray. Everything after that is downstream from one bad decision.

  3. Distinguish model vs. system failures. If the agent called the right tool but the tool returned an error, that's infrastructure. If the agent decided to delete an order when the user asked for status, that's reasoning. Different causes, different fixes.

  4. Test the capability limit. Swap in a more capable model and replay the trace. If it passes, your prompt is fine—you just need a smarter model for this task. If it still fails, the instructions are unclear.

  5. Check context pressure. Long traces often fail near the end when early context gets compressed. Signs: the agent contradicts itself, repeats questions already answered, or ignores information it previously acknowledged.

Common Failure Patterns

After debugging enough failures, signatures emerge:

SymptomLikely Cause
Wrong tool calledTool descriptions overlap or aren't distinct
Right tool, wrong argumentsAmbiguous parameter names, missing validation
Hallucinated factsMissing retrieval, outdated knowledge, no grounding
Infinite loopsMissing exit conditions, confused by repeated failures
Confident wrong answersCapability limit—try chain-of-thought or stronger model
Ignores earlier contextContext too long, important info pushed out of attention

The Economics of All This

Evals cost money. Every trial burns tokens. Every human review costs time. This is real, and ignoring it leads to either under-investment (no evals, chaos) or over-investment (99% test coverage for trivial features, slow iteration).

Token Math

A typical eval run might involve running the agent multiple times per test case (say, 5 trials), plus running LLM judges on each output. For 100 test cases at 2K tokens per trial plus 500 tokens for judging, you're looking at about 12K tokens per case—1.2 million tokens per full run.

At $0.01 per 1K tokens, that's $12 per run. Sounds cheap. But run it 10 times during development: $120. Weekly for a year: $624. On every CI push (20/day): $87K annually. Costs compound.

Be Strategic

Tier your eval runs by frequency and depth:

  • Every commit: Smoke tests (10 critical cases that must pass)
  • PR merge: Full regression (100 cases covering core functionality)
  • Weekly: Capability evals (500 cases, broader coverage, harder tests)
  • On-demand: Human review (calibration, ambiguous cases, high-stakes domains)

Optimize further by caching agent outputs when testing different judge prompts, sampling from large test suites rather than running everything, and using cheaper models for obvious cases while escalating to expensive models for nuance.

The goal isn't maximum coverage. It's maximum signal per dollar spent.


Connecting Evals to Production

Eval suites test what you predicted would matter. Production reveals what actually matters.

Monitor What Counts

Track production metrics alongside eval scores:

  • Success rate: What percentage of conversations end with the user's problem solved? (Requires defining success—user confirms resolution, no follow-up ticket, positive feedback)
  • Fallback rate: How often does the agent escalate to humans or admit "I don't know"?
  • Error rate: Tool failures, filtered responses, abrupt conversation endings
  • Latency: Time to response, end-to-end task duration
  • Cost per session: Total tokens consumed

These production signals should correlate with your eval scores. If your eval suite improves but production success rate drops, your tests aren't measuring what matters.

The Feedback Loop

The most valuable pattern: every production failure becomes a test case.

When a user reports a problem, pull the trace. Diagnose the failure. Ship the fix. Add a regression test that would have caught this bug. Run the full suite to verify you didn't break anything else.

Your eval suite becomes a living artifact encoding every lesson learned. The team's knowledge accumulates in code, not in tribal memory.

A/B Test Significant Changes

When you make a major change—new model, substantial prompt rewrite—run an A/B test. Route some users to the new version. Compare success rates, task completion, user feedback. Evals test what you thought to test. A/B tests catch everything else.


Reference Architecture

The core loop: evals gate deployment, production surfaces failures, failures become tests, tests prevent recurrence.


In Practice

Observability and evaluation answer the two hardest questions in agent development: what happened, and was it good?

Without tracing, you're guessing why things broke. Without evaluation, you're guessing whether things are getting better. Without the feedback loop connecting production to tests, you're repeating the same mistakes.

The tools exist. The patterns are established. What's often missing is the discipline to implement them before a crisis forces your hand.

Start small. Instrument your agent. Encode your failures as tests. Build the habit of looking at transcripts, not just outputs. The goal isn't an agent that never fails—probabilistic systems will always surprise you. The goal is an agent where failures are visible, diagnosable, and prevented from recurring.


References