Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Why traditional testing breaks down for AI systems, and how evals help you measure quality, catch regressions, and ship with confidence.
If you've built or shipped an AI-powered product, you've likely run into at least one of these problems:
These problems share a common root: AI systems are non-deterministic. Unlike traditional software, where the same input reliably produces the same output, LLMs can behave differently each time they run. AI can hallucinate facts, produce inconsistent results, degrade when models get upgraded, and behave differently depending on the wording of a prompt. When something breaks, it's not always obvious why. When you want to improve the system, it's not always obvious how.
Without an evaluation infrastructure, it's almost impossible to answer these questions systematically.
In April 2025, OpenAI had to roll back one of its models. They had updated GPT-4o to make it more helpful, but in the process they made it too agreeable, and therefore less truthful. Developers described the behavior as "overly flattering or agreeable, often described as sycophantic." By weighting short-term thumbs-up/thumbs-down feedback too heavily, OpenAI pushed the model toward responses that were supportive but disingenuous.
The rollback was public and entirely preventable. With the right evals in place, the team could have caught the regression before shipping by measuring whether the model's responses maintained honesty and accuracy alongside satisfaction scores, rather than optimizing for one metric at the expense of another.
Evals (short for evaluations) are the practice of systematically testing, scoring, and comparing LLM outputs. They help you:
In the next lesson, you'll learn what an eval actually is: the three components every eval needs (a dataset, a task, and a scorer) and the three main approaches to scoring.