Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Create a dataset, test two chatbot personas in the playground, build a custom LLM-as-judge scorer, and save the results as experiments. No code required.
All the assets for this module are available at braintrustdata/eval-101-course/module-03.
In this lesson, you'll run your first eval entirely in the Braintrust UI. No code required. By the end, you'll have:
We'll be using the customer support chatbot example from module two and building the entire eval in the Braintrust UI.
Sign in at braintrust.dev and create a new project called "Customer Support Chatbot."
Before running evals, configure your AI providers under Settings > AI providers. Add your OpenAI API key (and any other providers you want to test). Braintrust supports OpenAI, Anthropic, Google Gemini, Together AI, and others.
Navigate to Datasets in the left sidebar and select Upload dataset. Upload a CSV or JSON file containing customer support messages.
Your dataset should have at minimum an input column with the customer messages. You can also include expected outputs, metadata, and tags.
Example rows:
| input |
|---|
| "Why did my package disappear after tracking showed it was delivered?" |
| "Your product smells like burnt rubber. What's going on?" |
| "I've been waiting 3 weeks for a response from your team." |
| "Can I get a refund if I already opened the product?" |
Open Playgrounds and set up your base task. Select a model (for example, GPT-4o mini) and define your system prompt.
You'll create two personas to compare:
Persona 1: Empathetic agent
You are a warm, empathetic customer support agent. Always acknowledge the
customer's feelings before addressing their issue. Use phrases like "I completely
understand how frustrating that must be" and "I'm so sorry you're dealing with
this." Be thorough in your response and make the customer feel heard.
Persona 2: Efficient agent
You are an efficient, no-nonsense customer support agent. Get straight to the
point. Provide the necessary information and next steps without filler. Be polite
but brief.
Run each persona against your dataset using the {{input}} template variable to inject each test case.
Select Create scorer and configure an LLM-as-judge scorer. Name it "Brand Alignment" and select GPT-4o mini as the judge model.
Define your scoring prompt:
You are evaluating a customer support response.
Customer message: {{input}}
Assistant response: {{output}}
Rate the response on:
- Helpfulness and clarity
- Brand voice consistency
- Policy compliance
- Tone and appropriateness
Set the output type to a score from 0 to 1, with choice scores:
Enable chain of thought (CoT) so the judge explains its reasoning.
Run each persona against the dataset with the Brand Alignment scorer applied. Once the results come in, save each run as an experiment. This creates a permanent snapshot of the results, including all inputs, outputs, scores, and metadata.
Saving experiments is important because it means you won't have to re-run them later to see the results. You can always go back and compare.
In the next lesson, you'll open the comparison view to analyze your two experiments side by side, looking at scores, token costs, and chain-of-thought reasoning to decide which persona to ship.