Skip to main content
Evaluations let you systematically measure AI quality. Compare approaches, catch regressions before deployment, and validate improvements with data instead of intuition. Each evaluation consists of three components:
  • Data - A dataset of test cases with inputs and expected outputs
  • Task - An AI function you want to test
  • Scores - Scoring functions that measure output quality
Set up your environment and run evals with the Braintrust SDK.

1. Sign up

If you’re new to Braintrust, sign up free at braintrust.dev.

2. Get API keys

Create API keys for:Set them as environment variables:
export BRAINTRUST_API_KEY="<your-braintrust-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>" # or ANTHROPIC_API_KEY, GEMINI_API_KEY, etc.
This quickstart uses OpenAI. For other providers, see Integrations.

3. Install SDKs

Install the Braintrust SDK and required libraries:
# pnpm
pnpm add braintrust openai autoevals ts-node
# npm
npm install braintrust openai autoevals ts-node

4. Run an eval

Build an evaluation that identifies movies from plot descriptions. You’ll define a dataset with movie plot descriptions as inputs and expected titles as outputs, write a task function with a prompt to identify movies, and use a scorer to measure accuracy.
1

Set your project

Set your project name as an environment variable:
export BRAINTRUST_DEFAULT_PROJECT_NAME="Evaluation quickstart"
2

Write your evaluation

Create an evaluation that defines your dataset, task, and scorer (built-in ExactMatch scorer for Python and TypeScript, equivalent code-based scorer for other languages):
movie-matcher.eval.ts
import { Eval } from "braintrust";
import { ExactMatch } from "autoevals";
import OpenAI from "openai";

const client = new OpenAI();

Eval("Movie matcher", {
  // Data: Test cases with inputs and expected outputs
  data: [
    {
      input: "A detective investigates a series of murders based on the seven deadly sins.",
      expected: "Se7en",
    },
    {
      input: "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O.",
      expected: "Inception",
    },
    {
      input: "A computer hacker learns from mysterious rebels about the true nature of his reality and his role in the war against its controllers.",
      expected: "The Matrix",
    },
    {
      input: "A cowboy doll is profoundly threatened and jealous when a new spaceman figure supplants him as top toy in a boy's room.",
      expected: "Toy Story",
    },
    {
      input: "An orphaned boy discovers he's a wizard on his 11th birthday when Hagrid escorts him to magic-teaching Hogwarts School.",
      expected: "Harry Potter and the Sorcerer's Stone",
    },
  ],

  // Task: The function being evaluated
  task: async (input) => {
    const response = await client.responses.create({
      model: "gpt-5-mini",
      input: [
        {
          role: "system",
          content: "Based on the following description, identify the movie."
        },
        { role: "user", content: input }
      ],
    });
    return response.output_text;
  },

  // Scores: Metrics to measure quality
  scores: [ExactMatch],
});
3

Run the evaluation

Run your evaluation:
npx braintrust eval movie-matcher.eval.ts
This creates an experiment, a permanent record of how your task performed on the dataset. Each experiment captures inputs, outputs, scores, and metadata, making it easy to compare different versions of your prompts or models.
4

View results

You’ll see a link to your experiment in the terminal output.Click the link to view your evaluation results, or go to Experiments in the “Evaluation quickstart” project in the Braintrust UI.

5. Iterate

You might notice that some scores are 0%. This is because the scorer requires outputs to exactly match the expected value. For example, if the AI returns “The movie is Se7en” instead of “Se7en”, or uses the UK title “Harry Potter and the Philosopher’s Stone” instead of the expected US title “Harry Potter and the Sorcerer’s Stone”, the score will be 0% for that case.Let’s improve the prompt to return only US-based movie titles and create a second experiment.
1

Update your evaluation

In your eval code, change the prompt to:
Identify the movie from the description. Return only the movie title, with no additional text or explanation. Always use the US-based title.
2

Run the evaluation

Run the improved evaluation:
npx braintrust eval movie-matcher.eval.ts
3

View results

Click the link to your new experiment in the terminal output.The improved prompt should have higher scores because it returns just the movie title. In the Braintrust UI, you can compare this experiment with your first one to see the improvement.

Troubleshoot

Verify your dataset name matches exactly what you see in the Braintrust UI:
// Make sure this matches your dataset name
data: initDataset("Movie matcher", {
  dataset: "Movie matcher dataset"  // Check this name in the UI
})
Go to Datasets in your Braintrust project and confirm the dataset name.
Install all required packages:
pnpm add braintrust openai autoevals
Check your environment variables:
echo $BRAINTRUST_API_KEY
echo $OPENAI_API_KEY
Both should return values. If empty, set them:
export BRAINTRUST_API_KEY="your-braintrust-key"
export OPENAI_API_KEY="your-openai-key"
Get your Braintrust API key from Settings > API Keys.
Check your terminal output for the experiment link after running braintrust eval. Click it to navigate directly to the experiment.If you don’t see a link:
  • Check for error messages in terminal output
  • Verify network connectivity
  • Ensure you’re viewing the correct project (“Evaluation quickstart”)

Next steps