Learn everything you need to know about evals by building and monitoring a customer support chatbot from scratch.
Rebuild the same eval in Python using the Braintrust SDK. Define task functions, create an LLMClassifier scorer, and run experiments with Eval().
All the assets for this module are available at braintrustdata/eval-101-course/module-06.
The UI is a great way to get started with evals, but most teams eventually want to run their evals in code. Code gives you version control for your prompts and scores, lets you run experiments programmatically, and lets you integrate evals into your CI/CD pipeline.
In this lesson, you'll rebuild the same customer support eval from module 3 using the Braintrust Python SDK.
Install the Braintrust SDK and the autoevals library, then set your API key:
pip install braintrust autoevals openai
export BRAINTRUST_API_KEY=your-api-key
Braintrust provides wrap_openai, a drop-in wrapper around the OpenAI client that automatically logs all LLM calls to Braintrust. Every prompt, response, token count, and latency measurement gets captured without changing your application code.
import openai
from braintrust import wrap_openai
client = wrap_openai(openai.OpenAI())
From this point on, use client everywhere you'd normally use the OpenAI client. Braintrust handles the logging.
The task function takes a test case from your dataset and returns the model's output. Braintrust calls this function for every row in your dataset.
async def task(input):
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are an efficient, no-nonsense customer support agent. "
"Get straight to the point. Provide the necessary information "
"and next steps without filler. Be polite but brief.",
},
{"role": "user", "content": input},
],
)
return response.choices[0].message.content
To test a different persona, swap the system prompt. To compare providers, swap the model.
The autoevals library includes LLMClassifier, which lets you define LLM-as-judge scorers in code. This is the code equivalent of the Brand Alignment scorer you built in the UI.
from autoevals import LLMClassifier
brand_alignment = LLMClassifier(
name="Brand Alignment",
prompt_template="""You are evaluating a customer support response.
Customer message: {{input}}
Assistant response: {{output}}
Rate the response on helpfulness, brand voice, policy compliance, and tone.""",
choice_scores={"A": 1, "B": 0.5, "C": 0},
use_cot=True,
)
Setting use_cot=True enables chain-of-thought reasoning, so the judge explains why it gave a particular score.
The Eval() function ties everything together: dataset, task, and scorer.
from braintrust import Eval
Eval(
"Customer Support Chatbot",
data=lambda: [
{"input": "How do I reset my password?"},
{"input": "My order never arrived."},
{"input": "Can I get a refund?"},
{"input": "Your app keeps crashing on iOS."},
],
task=task,
scores=[brand_alignment],
)
When you run this script, Braintrust creates an experiment, runs every test case through your task function, scores each output, and uploads the results. You can then view them in the Braintrust UI alongside your UI-based experiments from earlier modules.
After the eval finishes, Braintrust prints a link to the experiment in your terminal. Open it to see the full results table with inputs, outputs, scores, and chain-of-thought reasoning.
You can run the same script multiple times with different system prompts or models. Each run creates a new experiment that you can compare against previous ones, just like you did in the UI.
In the next lesson, you'll learn about non-determinism in AI evals and how to handle it. Running the same eval twice doesn't always produce the same scores, and understanding why is important before you start making decisions based on your results.