Evals & Testing
Testing traditional software is binary: assert(2 + 2 == 4).
Testing AI output is fuzzy: assert("The capital is Paris") ≈ "Paris is the capital of France".
How do we build a CI/CD pipeline for agents? The industry answer is Evals.
The Hierarchy of Testing
1. Unit Tests (Deterministic)
Test the code around the LLM.
- Does the prompt builder throw errors?
- Does the tool parser handle empty JSON?
- Does the markdown renderer crash on malicious input?
2. Model-Based Evals (LLM-as-a-Judge)
Use a stronger model (e.g., GPT-4o) to grade the output of your agent.
The Criteria:
- Factuality: Is the answer supported by the context context?
- Correctness: Did it solve the user's problem?
- Tone/Style: Is it helpful and concise?
// Pseudo-code for LLM-as-a-Judge
async function evaluate(agentOutput, groundTruth) {
const prompt = `
Compare the AGENT OUTPUT with the GROUND TRUTH.
Grade it on a scale of 0 to 1 (Fail/Pass).
GROUND TRUTH: ${groundTruth}
AGENT OUTPUT: ${agentOutput}
`;
return await strongModel.generate(prompt);
}Building a Golden Dataset
You need a dataset of (Input, ExpectedOutput) pairs.
- Cold Start: manually write 20 high-quality examples covering happy paths and edge cases.
- Production Logging: log real user queries.
- Curate: periodically review failed logs and add them to the Golden Dataset as regression tests.
Metrics that Matter
- Pass Rate: % of test cases that pass the Judge.
- Latency: P95 and P99 response times.
- Cost: Average token cost per task.
Summary
You cannot improve what you cannot measure. "Vibe checks" (manually trying a few prompts) scale poorly. Build a Golden Dataset and an Eval Pipeline (using tools like Braintrust or LangSmith) to deploy with confidence.