Evals & Testing

Testing traditional software is binary: assert(2 + 2 == 4). Testing AI output is fuzzy: assert("The capital is Paris") ≈ "Paris is the capital of France".

How do we build a CI/CD pipeline for agents? The industry answer is Evals.

The Hierarchy of Testing

1. Unit Tests (Deterministic)

Test the code around the LLM.

  • Does the prompt builder throw errors?
  • Does the tool parser handle empty JSON?
  • Does the markdown renderer crash on malicious input?

2. Model-Based Evals (LLM-as-a-Judge)

Use a stronger model (e.g., GPT-4o) to grade the output of your agent.

The Criteria:

  • Factuality: Is the answer supported by the context context?
  • Correctness: Did it solve the user's problem?
  • Tone/Style: Is it helpful and concise?
// Pseudo-code for LLM-as-a-Judge
async function evaluate(agentOutput, groundTruth) {
    const prompt = `
    Compare the AGENT OUTPUT with the GROUND TRUTH.
    Grade it on a scale of 0 to 1 (Fail/Pass).
    
    GROUND TRUTH: ${groundTruth}
    AGENT OUTPUT: ${agentOutput}
    `;
    
    return await strongModel.generate(prompt);
}

Building a Golden Dataset

You need a dataset of (Input, ExpectedOutput) pairs.

  1. Cold Start: manually write 20 high-quality examples covering happy paths and edge cases.
  2. Production Logging: log real user queries.
  3. Curate: periodically review failed logs and add them to the Golden Dataset as regression tests.

Metrics that Matter

  • Pass Rate: % of test cases that pass the Judge.
  • Latency: P95 and P99 response times.
  • Cost: Average token cost per task.

Summary

You cannot improve what you cannot measure. "Vibe checks" (manually trying a few prompts) scale poorly. Build a Golden Dataset and an Eval Pipeline (using tools like Braintrust or LangSmith) to deploy with confidence.