Simple eval using the SDK

Rebuild the same eval in Python using the Braintrust SDK. Define task functions, create an LLMClassifier scorer, and run experiments with Eval().

All the assets for this module are available at braintrustdata/eval-101-course/module-06.

Why move to code

The UI is a great way to get started with evals, but most teams eventually want to run their evals in code. Code gives you version control for your prompts and scores, lets you run experiments programmatically, and lets you integrate evals into your CI/CD pipeline.

In this lesson, you'll rebuild the same customer support eval from module 3 using the Braintrust Python SDK.

Setting up

Install the Braintrust SDK and the autoevals library, then set your API key:

bash
pip install braintrust autoevals openai
export BRAINTRUST_API_KEY=your-api-key

Wrapping your LLM client

Braintrust provides wrap_openai, a drop-in wrapper around the OpenAI client that automatically logs all LLM calls to Braintrust. Every prompt, response, token count, and latency measurement gets captured without changing your application code.

python
import openai
from braintrust import wrap_openai

client = wrap_openai(openai.OpenAI())

From this point on, use client everywhere you'd normally use the OpenAI client. Braintrust handles the logging.

Defining the task

The task function takes a test case from your dataset and returns the model's output. Braintrust calls this function for every row in your dataset.

python
async def task(input):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are an efficient, no-nonsense customer support agent. "
                "Get straight to the point. Provide the necessary information "
                "and next steps without filler. Be polite but brief.",
            },
            {"role": "user", "content": input},
        ],
    )
    return response.choices[0].message.content

To test a different persona, swap the system prompt. To compare providers, swap the model.

Scoring with LLMClassifier

The autoevals library includes LLMClassifier, which lets you define LLM-as-judge scorers in code. This is the code equivalent of the Brand Alignment scorer you built in the UI.

python
from autoevals import LLMClassifier

brand_alignment = LLMClassifier(
    name="Brand Alignment",
    prompt_template="""You are evaluating a customer support response.

Customer message: {{input}}
Assistant response: {{output}}

Rate the response on helpfulness, brand voice, policy compliance, and tone.""",
    choice_scores={"A": 1, "B": 0.5, "C": 0},
    use_cot=True,
)

Setting use_cot=True enables chain-of-thought reasoning, so the judge explains why it gave a particular score.

Running the eval

The Eval() function ties everything together: dataset, task, and scorer.

python
from braintrust import Eval

Eval(
    "Customer Support Chatbot",
    data=lambda: [
        {"input": "How do I reset my password?"},
        {"input": "My order never arrived."},
        {"input": "Can I get a refund?"},
        {"input": "Your app keeps crashing on iOS."},
    ],
    task=task,
    scores=[brand_alignment],
)

When you run this script, Braintrust creates an experiment, runs every test case through your task function, scores each output, and uploads the results. You can then view them in the Braintrust UI alongside your UI-based experiments from earlier modules.

Viewing results

After the eval finishes, Braintrust prints a link to the experiment in your terminal. Open it to see the full results table with inputs, outputs, scores, and chain-of-thought reasoning.

You can run the same script multiple times with different system prompts or models. Each run creates a new experiment that you can compare against previous ones, just like you did in the UI.

What's next

In the next lesson, you'll learn about non-determinism in AI evals and how to handle it. Running the same eval twice doesn't always produce the same scores, and understanding why is important before you start making decisions based on your results.

Further reading

Trace everything