Evaluation quickstart

Evaluations let you systematically measure AI quality. Compare approaches, catch regressions before deployment, and validate improvements with data instead of intuition. Each evaluation consists of three components:

Data - A dataset of test cases with inputs and expected outputs
Task - An AI function you want to test
Scores - Scoring functions that measure output quality

Set up your environment and run evals with the Braintrust SDK.If you’re new to Braintrust, sign up free at braintrust.dev.

2. Get API keys

Create API keys for:

Braintrust
Your AI provider or framework (OpenAI, Anthropic, Gemini, etc.)

Set them as environment variables:

export BRAINTRUST_API_KEY="<your-braintrust-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>" # or ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.

3. Install SDKs

Install the Braintrust SDK and required libraries:

# pnpm
pnpm add braintrust openai autoevals ts-node
# npm
npm install braintrust openai autoevals ts-node

4. Run an eval

Build an evaluation that identifies movies from plot descriptions. You’ll define a dataset with movie plot descriptions as inputs and expected titles as outputs, write a task function with a prompt to identify movies, and use a scorer to measure accuracy.

Set your project

Set your project name as an environment variable:

export BRAINTRUST_DEFAULT_PROJECT_NAME="Evaluation quickstart"

Write your evaluation

Create an evaluation that defines your dataset, task, and scorer (built-in ExactMatch scorer for Python and TypeScript, equivalent code-based scorer for other languages):

movie-matcher.eval.ts

import { Eval } from "braintrust";
import { ExactMatch } from "autoevals";
import OpenAI from "openai";

const client = new OpenAI();

Eval("Movie matcher", {
  // Data: Test cases with inputs and expected outputs
  data: [
    {
      input: "A detective investigates a series of murders based on the seven deadly sins.",
      expected: "Se7en",
    },
    {
      input: "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
      expected: "Inception",
    },
    {
      input: "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",
      expected: "The Matrix",
    },
    {
      input: "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
      expected: "Toy Story",
    },
    {
      input: "An orphaned boy discovers he's a wizard on his 11th birthday when Hagrid escorts him to magic-teaching Hogwarts School.",
      expected: "Harry Potter and the Sorcerer's Stone",
    },
  ],

  // Task: The function being evaluated
  task: async (input) => {
    const response = await client.responses.create({
      model: "gpt-5-mini",
      input: [
        {
          role: "system",
          content: "Based on the following description, identify the movie."
        },
        { role: "user", content: input }
      ],
    });
    return response.output_text;
  },

  // Scores: Metrics to measure quality
  scores: [ExactMatch],
});

Run the evaluation

Run your evaluation:

npx braintrust eval movie-matcher.eval.ts

This creates an experiment, a permanent record of how your task performed on the dataset. Each experiment captures inputs, outputs, scores, and metadata, making it easy to compare different versions of your prompts or models.

View results

You’ll see a link to your experiment in the terminal output.Click the link to view your evaluation results, or go to Experiments in the “Evaluation quickstart” project in the Braintrust UI.

5. Iterate

You might notice that some scores are 0%. This is because the scorer requires outputs to exactly match the expected value. For example, if the AI returns “The movie is Se7en” instead of “Se7en”, or uses the UK title “Harry Potter and the Philosopher’s Stone” instead of the expected US title “Harry Potter and the Sorcerer’s Stone”, the score will be 0% for that case.Let’s improve the prompt to return only US-based movie titles and create a second experiment.

Update your evaluation

In your eval code, change the prompt to:

Identify the movie from the description. Return only the movie title, with no additional text or explanation. Always use the US-based title.

Run the evaluation

Run the improved evaluation:

npx braintrust eval movie-matcher.eval.ts

View results

Click the link to your new experiment in the terminal output.The improved prompt should have higher scores because it returns just the movie title. In the Braintrust UI, you can compare this experiment with your first one to see the improvement.

Troubleshoot

Dataset not found error?

Verify your dataset name matches exactly what you see in the Braintrust UI:

// Make sure this matches your dataset name
data: initDataset("Movie matcher", {
  dataset: "Movie matcher dataset"  // Check this name in the UI
})

Go to Datasets in your Braintrust project and confirm the dataset name.

Import errors or missing modules?

Install all required packages:

pnpm add braintrust openai autoevals

API key errors?

Check your environment variables:

echo $BRAINTRUST_API_KEY
echo $OPENAI_API_KEY

Both should return values. If empty, set them:

export BRAINTRUST_API_KEY="your-braintrust-key"
export OPENAI_API_KEY="your-openai-key"

Get your Braintrust API key from Settings > API Keys.

Not seeing experiments in UI?

Check your terminal output for the experiment link after running braintrust eval. Click it to navigate directly to the experiment.If you don’t see a link:

Check for error messages in terminal output
Verify network connectivity
Ensure you’re viewing the correct project (“Evaluation quickstart”)

Need help?

Join our Discord
Email us at support@braintrust.dev
Use the Loop feature in the Braintrust UI

Build and iterate on evaluations visually in playgrounds without writing code.If you’re new to Braintrust, sign up free at braintrust.dev.

2. Add an AI provider

Braintrust lets you call AI providers directly from the UI. For this quickstart, you’ll use OpenAI:

Get an OpenAI API key from platform.openai.com/api-keys.
Go to Settings > AI providers.
Select OpenAI from the provider list.
Enter your OpenAI API key.
Click Save.

3. Run an eval

Build an evaluation that identifies movies from plot descriptions. You’ll upload test cases, create prompts, add scoring, and compare results interactively - all in a playground, a workspace for rapid iteration without writing code. .

Create a playground

Go to Playgrounds.
Select Create empty playground.
Enter Movie matcher as the name and select Create.
Select the playground in the list.

Create a prompt

A prompt is the instruction you give to an AI model to complete a task.

In the playground, choose GPT-5 mini as your model.
Enter the following system prompt:
```
Based on the following description, identify the movie.
```
Select + Message and enter the following user message:
```
{{input}}
```
Prompts can use templating syntax to refer to variables. In this case, the input corresponds to the movie description given by the user.

Add a dataset

To evaluate how well your prompt works, you need test data. A dataset contains inputs and the outputs you expect from your AI.

Download this sample dataset.
In the playground, select Select a dataset > Upload new dataset.
Upload your CSV file. Columns automatically map to the input and expected fields. Drag and drop them into different categories as needed:
Click Import.

Add a scorer

Scorers measure the quality of your AI’s outputs using built-in functions, custom code, or LLM judges. For this task, you’ll use the ExactMatch built-in scorer because movie titles have clear right and wrong answers.

In the playground, select + Scorer.
Select AutoEvals > ExactMatch.

Run an evaluation

Select Run at the top of the playground to see how well your prompt performs.The playground will execute your prompt against all rows in your dataset and score the results. You’ll see:

The AI’s response for each movie description
The ExactMatch score (100% for correct, 0% for incorrect)

4. Iterate

You might notice that some ExactMatch scores are 0%. This is because ExactMatch requires outputs to exactly match the expected value. For example, if the AI returns “The movie is Se7en” instead of “Se7en”, or uses the UK title “Harry Potter and the Philosopher’s Stone” instead of the expected US title “Harry Potter and the Sorcerer’s Stone”, the score will be 0% for that case.Let’s create an improved prompt that returns only the US-based movie title:

In the playground, select + Task.
Select Prompt > + Blank prompt.

Add a more specific system message:

Identify the movie from the description. Return only the movie title, with no additional text or explanation. Always use the US-based title.

Select + Message and enter the following user message:
```
{{input}}
```
Select Save prompt.
Select Run to compare both prompts side-by-side.

You’ll now see results for both prompts in the playground. The improved prompt should have higher ExactMatch scores because it returns just the movie title.This comparison view helps you quickly see which prompt performs better. You can add multiple prompt variations to test different approaches.

Troubleshoot

Playground not running?

Check your AI provider: Verify you’ve added OpenAI in Settings > AI providers with a valid API key.Check dataset uploaded: Go to Datasets and confirm your CSV imported successfully. You should see all 21 movie examples.Browser issues: Try refreshing the page or using a different browser. Clear your browser cache if the playground seems stuck.

Not seeing results?

Refresh the page: The UI updates in real-time, but try refreshing if results don’t appear.Check for errors: Look for error messages in the playground. Common issues:

Invalid API key
Model not selected
Empty dataset

Need help?

Join our Discord
Email us at support@braintrust.dev
Use the Loop feature in the Braintrust UI

Next steps

Explore the full Braintrust workflow
Go deeper with evaluation:
- Write custom scorers - Measure what matters for your use case
- Compare experiments - Systematically test different approaches
- Build datasets - Create representative test cases from production data
- Run evaluations in CI/CD - Catch regressions automatically

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Evaluation quickstart

2. Get API keys

3. Install SDKs

4. Run an eval

5. Iterate

Troubleshoot

2. Add an AI provider

3. Run an eval

4. Iterate

Troubleshoot

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​1. Sign up

​2. Get API keys

​3. Install SDKs

​4. Run an eval

​5. Iterate

​Troubleshoot

​1. Sign up

​2. Add an AI provider

​3. Run an eval

​4. Iterate

​Troubleshoot

​Next steps

1. Sign up

2. Get API keys

3. Install SDKs

4. Run an eval

5. Iterate

Troubleshoot

1. Sign up

2. Add an AI provider

3. Run an eval

4. Iterate

Troubleshoot

Next steps