Simple eval using the UI

Create a dataset, test two chatbot personas in the playground, build a custom LLM-as-judge scorer, and save the results as experiments. No code required.

All the assets for this module are available at braintrustdata/eval-101-course/module-03.

What you'll build

In this lesson, you'll run your first eval entirely in the Braintrust UI. No code required. By the end, you'll have:

  • A dataset of customer support queries loaded into Braintrust
  • Two chatbot personas with different system prompts
  • A custom LLM-as-judge scorer for brand alignment
  • Two saved experiments comparing the personas

We'll be using the customer support chatbot example from module two and building the entire eval in the Braintrust UI.

Step 1: Set up Braintrust

Sign in at braintrust.dev and create a new project called "Customer Support Chatbot."

Before running evals, configure your AI providers under Settings > AI providers. Add your OpenAI API key (and any other providers you want to test). Braintrust supports OpenAI, Anthropic, Google Gemini, Together AI, and others.

Step 2: Create a dataset

Navigate to Datasets in the left sidebar and select Upload dataset. Upload a CSV or JSON file containing customer support messages.

Your dataset should have at minimum an input column with the customer messages. You can also include expected outputs, metadata, and tags.

Example rows:

input
"Why did my package disappear after tracking showed it was delivered?"
"Your product smells like burnt rubber. What's going on?"
"I've been waiting 3 weeks for a response from your team."
"Can I get a refund if I already opened the product?"

Step 3: Test two personas in the playground

Open Playgrounds and set up your base task. Select a model (for example, GPT-4o mini) and define your system prompt.

You'll create two personas to compare:

Persona 1: Empathetic agent

You are a warm, empathetic customer support agent. Always acknowledge the
customer's feelings before addressing their issue. Use phrases like "I completely
understand how frustrating that must be" and "I'm so sorry you're dealing with
this." Be thorough in your response and make the customer feel heard.

Persona 2: Efficient agent

You are an efficient, no-nonsense customer support agent. Get straight to the
point. Provide the necessary information and next steps without filler. Be polite
but brief.

Run each persona against your dataset using the {{input}} template variable to inject each test case.

Step 4: Create a scorer

Select Create scorer and configure an LLM-as-judge scorer. Name it "Brand Alignment" and select GPT-4o mini as the judge model.

Define your scoring prompt:

You are evaluating a customer support response.

Customer message: {{input}}
Assistant response: {{output}}

Rate the response on:
- Helpfulness and clarity
- Brand voice consistency
- Policy compliance
- Tone and appropriateness

Set the output type to a score from 0 to 1, with choice scores:

  • A (score: 1) for responses that meet all criteria
  • B (score: 0.5) for partial alignment
  • C (score: 0) for responses that miss the mark

Enable chain of thought (CoT) so the judge explains its reasoning.

Step 5: Run and save experiments

Run each persona against the dataset with the Brand Alignment scorer applied. Once the results come in, save each run as an experiment. This creates a permanent snapshot of the results, including all inputs, outputs, scores, and metadata.

Saving experiments is important because it means you won't have to re-run them later to see the results. You can always go back and compare.

What's next

In the next lesson, you'll open the comparison view to analyze your two experiments side by side, looking at scores, token costs, and chain-of-thought reasoning to decide which persona to ship.

Further reading

Trace everything