Evals & Testing

Testing traditional software is binary: assert(2 + 2 == 4). Testing AI output is fuzzy: assert("The capital is Paris") ≈ "Paris is the capital of France".

How do we build a CI/CD pipeline for agents? The industry answer is Evals.

The Hierarchy of Testing

1. Unit Tests (Deterministic)

Test the code around the LLM.

Does the prompt builder throw errors?
Does the tool parser handle empty JSON?
Does the markdown renderer crash on malicious input?

2. Model-Based Evals (LLM-as-a-Judge)

Use a stronger model (e.g., GPT-4o) to grade the output of your agent.

The Criteria:

Factuality: Is the answer supported by the context context?
Correctness: Did it solve the user's problem?
Tone/Style: Is it helpful and concise?

// Pseudo-code for LLM-as-a-Judge
async function evaluate(agentOutput, groundTruth) {
    const prompt = `
    Compare the AGENT OUTPUT with the GROUND TRUTH.
    Grade it on a scale of 0 to 1 (Fail/Pass).
    
    GROUND TRUTH: ${groundTruth}
    AGENT OUTPUT: ${agentOutput}
    `;
    
    return await strongModel.generate(prompt);
}

Building a Golden Dataset

You need a dataset of (Input, ExpectedOutput) pairs.

Cold Start: manually write 20 high-quality examples covering happy paths and edge cases.
Production Logging: log real user queries.
Curate: periodically review failed logs and add them to the Golden Dataset as regression tests.

Metrics that Matter

Pass Rate: % of test cases that pass the Judge.
Latency: P95 and P99 response times.
Cost: Average token cost per task.

Summary

You cannot improve what you cannot measure. "Vibe checks" (manually trying a few prompts) scale poorly. Build a Golden Dataset and an Eval Pipeline (using tools like Braintrust or LangSmith) to deploy with confidence.