Why are evals important?

Why traditional testing breaks down for AI systems, and how evals help you measure quality, catch regressions, and ship with confidence.

The problems traditional testing can't catch

If you've built or shipped an AI-powered product, you've likely run into at least one of these problems:

  1. Your code works well in testing, but once deployed, it starts to hallucinate or produce inconsistent results.
  2. A model your system depends on gets upgraded, and your application starts behaving differently.
  3. You push changes that improve one part of your application but cause regressions in another.
  4. You're trying to choose a model that gives accurate results without breaking the bank on cost.
  5. You've tweaked your prompts to improve output, but you have no way to measure whether the improvement is real.
  6. Your team asks "is this ready to ship?" and all you have is intuition, not data.

These problems share a common root: AI systems are non-deterministic. Unlike traditional software, where the same input reliably produces the same output, LLMs can behave differently each time they run. AI can hallucinate facts, produce inconsistent results, degrade when models get upgraded, and behave differently depending on the wording of a prompt. When something breaks, it's not always obvious why. When you want to improve the system, it's not always obvious how.

Without an evaluation infrastructure, it's almost impossible to answer these questions systematically.

A real-world example: the GPT-4o rollback

In April 2025, OpenAI had to roll back one of its models. They had updated GPT-4o to make it more helpful, but in the process they made it too agreeable, and therefore less truthful. Developers described the behavior as "overly flattering or agreeable, often described as sycophantic." By weighting short-term thumbs-up/thumbs-down feedback too heavily, OpenAI pushed the model toward responses that were supportive but disingenuous.

The rollback was public and entirely preventable. With the right evals in place, the team could have caught the regression before shipping by measuring whether the model's responses maintained honesty and accuracy alongside satisfaction scores, rather than optimizing for one metric at the expense of another.

What evals give you

Evals (short for evaluations) are the practice of systematically testing, scoring, and comparing LLM outputs. They help you:

  • Measure system quality. Understand accuracy, cost, and latency across a representative dataset instead of guessing.
  • Track improvements. See exactly how changes to prompts, models, or system architecture affect output quality.
  • Catch regressions. Detect failures before every deploy, while they're still cheap to fix.
  • Ship with confidence. Communicate with data, like "accuracy went from 78% to 91% with no increase in latency," not intuition. You'll have benchmarks for important parts of your application and you'll be able to monitor them in production.

What's next

In the next lesson, you'll learn what an eval actually is: the three components every eval needs (a dataset, a task, and a scorer) and the three main approaches to scoring.

Further reading

Trace everything