Use playgrounds

Playgrounds provide a no-code workspace for rapidly iterating on prompts, models, scorers, and datasets. Run full evaluations in real-time, compare results side-by-side, and share configurations with teammates.

Try the playground without signing up. Work is saved if you create an account.

Create a playground

Navigate to Evaluations > Playgrounds or select Create playground with prompt from a prompt dialog. A playground includes:

Tasks: One or more prompts, workflows, or scorers to evaluate
Scorers: Functions that measure output quality
Dataset: Optional test cases with inputs and expected outputs

Add tasks

Tasks define what you’re testing. Choose from four types:

Prompts

Configure AI model, prompt messages, parameters, tools, and MCP servers. This is the most common task type for testing model responses. See Write prompts for details.

Workflows

Chain multiple prompts together to test complex workflows. Workflows allow you to create multi-step processes where the output of one prompt becomes the input for the next.

Workflows are in beta. They currently only work in playgrounds and are limited to prompt chaining functionality. If you are on a hybrid deployment, workflows are available starting with v0.0.66.

To create a workflow, select + Workflow and create or select prompts to chain together. The prompts run consecutively, with each prompt receiving the previous prompt’s output as input. Variables in workflows: Workflows use templating to reference variables from datasets and previous prompts:

First prompt node: Access dataset variables directly using {{input}}, {{expected}}, and {{metadata}}. For consistency, you can also use {{dataset.input}}, {{dataset.expected}}, and {{dataset.metadata}}.
Later prompts: Access the previous node’s output using {{input}}. If the previous node outputs structured data, use dot notation like {{input.bar}}.
Global dataset access: The {{dataset}} variable is available in any prompt node to access the original dataset values (available in hybrid deployments starting with v1.1.1).

Remote evals

Connect to evaluations running on your own infrastructure while using Braintrust’s playground for iteration, comparison, and analysis. Use remote evals when you need custom infrastructure, specific runtime environments, security/compliance requirements, or long-running evaluations. See Run remote evaluations for setup details.

Scorers

Run scorers as tasks to validate and iterate on them before using them to evaluate other tasks. See Write scorers for details.

Scorers-as-tasks are different from scorers used to evaluate tasks. You can even score your scorers-as-tasks.

An empty playground prompts you to create a base task and optional comparison tasks. The base task is the source for diffing outputs.

Configure AI providers in organization settings, or configure them inline directly from the playground when you first run it.

Add scorers

Scorers quantify output quality using LLM judges or code. Use built-in autoevals or create custom scorers. To add a scorer, select + Scorer and choose from the list or create a new one:

Add datasets

Link a dataset to test multiple inputs at once. Without a dataset, the playground runs a single evaluation. With a dataset, it runs a matrix of evaluations across all test cases. You can select an existing dataset or create a new one inline without leaving the playground. When creating a dataset, you have two options:

Upload CSV/JSON: Import test cases from a file
Empty dataset: Create a blank dataset to populate manually later

Once linked, you’ll see a row for each dataset record. Reference dataset fields in prompts using template variables:

Analyze this input: {{input}}
Expected output: {{expected}}
User category: {{metadata.category}}

The playground supports Mustache and Nunjucks templating. Access nested fields like {{input.formula}}.

For scorers-as-tasks

When evaluating scorers, dataset inputs should match scorer convention: { input, expected, metadata, output }. These fields are hoisted into global scope for easy reference. Example scorer prompt:

Is {{output}} funny and concerning the same topic as {{expected}}?

Example dataset row:

{
  "input": {
    "output": "Why did the chicken cross the road? To get to the other side!",
    "expected": "Why's six afraid of seven? Because seven ate nine!"
  },
  "expected": {
    "choice": 0,
    "rationale": "Output is a clichéd joke about a different topic."
  }
}

Run evaluations

Select Run (or Cmd/Ctrl+Enter) to run all tasks and dataset rows in parallel. Results stream into the grid below.

You can also:

Run a single task
Run a single dataset row
View results in grid, list, or summary layout

For multimodal workflows, supported attachments will have a preview shown in the inline embedded view.

UI experiments timeout after 15 minutes. For longer evaluations, use the programmatic SDK.

View traces

Select a row to compare traces side-by-side and identify differences in outputs, scores, metrics, and inputs:

From this view, select Run row to re-run a single test case.

Compare with diff mode

Enable the diff toggle to visually compare variations across models, prompts, or workflows:

Diff mode highlights:

Output differences between tasks
Score changes
Timing and token usage variations

Save as experiment

Playgrounds overwrite previous runs for fast iteration. When you need an immutable snapshot for long-term reference or comparison, create an experiment:

Run your playground.
Select + Experiment.
Name your experiment.
Access it from the Experiments page.

Experiments preserve exact results and enable systematic comparison over time. Each playground task will map to its own experiment.

Collaborate by sharing playground URLs with teammates. They’ll see the same configuration and can run their own evaluations or make changes. Playgrounds automatically synchronize in real-time. Your collaborators must be members of your organization to view the playground. You can invite users from the settings page.

Best practices

Start simple: Test one prompt or model first. Add comparisons once the base works. Use representative data: Build datasets from production logs or known edge cases. Compare systematically: Change one variable at a time (model, temperature, prompt wording) to isolate effects. Look for patterns: Group by metadata fields to see which input types cause issues. Iterate quickly: Playgrounds excel at rapid experimentation. Save experiments only when you need permanent records.

Advanced options

Append dataset messages

You may have additional messages in a dataset that you want to append to a prompt. This option lets you specify a path to a messages array in the dataset. For example, if input is specified as the appended messages path and a dataset row has the following input, all prompts in the playground will run with additional messages:

[
  {
    "role": "assistant",
    "content": "Is there anything else I can help you with?"
  },
  {
    "role": "user",
    "content": "Yes, I have another question."
  }
]

To append messages from a dataset to your prompts, open the advanced settings menu next to your dataset selection and enter the path to the messages you want to append.

Max concurrency

The maximum number of tasks/scorers that will be run concurrently in the playground. This is useful for avoiding rate limits (429 - Too many requests) from AI/MCP providers.

Strict variables

When this option is enabled, evaluations will fail if the dataset row does not include all of the variables referenced in prompts.

Reasoning models

If you are on a hybrid deployment, reasoning support is available starting with v0.0.74.

Reasoning models like OpenAI’s o4, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.5 Flash generate intermediate reasoning steps before producing a final response. Braintrust provides unified support for these models, so you can work with reasoning outputs no matter which provider you choose. When you enable reasoning, models generate “thinking tokens” that show their step-by-step reasoning process. This is useful for complex tasks like math problems, logical reasoning, coding, and multi-step analysis. In playgrounds, you can configure reasoning parameters directly in the model settings.

To enable reasoning in a playground:

Select a reasoning-capable model (like claude-3-7-sonnet-latest, o4-mini, or publishers/google/models/gemini-2.5-flash-preview-04-17 for Gemini via Vertex AI).
In the model parameters section, configure your reasoning settings:
- Set reasoning_effort to low, medium, or high.
- Or enable reasoning_enabled and specify a reasoning_budget.
Run your prompt to see reasoning in action.

Next steps

Write prompts to test in playgrounds
Write scorers to measure quality
Interpret results from playground runs
Compare experiments saved from playgrounds

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Create a playground

Add tasks

Prompts

Workflows

Remote evals

Scorers

Add scorers

Add datasets

For scorers-as-tasks

Run evaluations

View traces

Compare with diff mode

Save as experiment

Best practices

Advanced options

Append dataset messages

Max concurrency

Strict variables

Reasoning models

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Create a playground

​Add tasks

​Prompts

​Workflows

​Remote evals

​Scorers

​Add scorers

​Add datasets

​For scorers-as-tasks

​Run evaluations

​View traces

​Compare with diff mode

​Save as experiment

​Share playgrounds

​Best practices

​Advanced options

​Append dataset messages

​Max concurrency

​Strict variables

​Reasoning models

​Next steps

Create a playground

Add tasks

Prompts

Workflows

Remote evals

Scorers

Add scorers

Add datasets

For scorers-as-tasks

Run evaluations

View traces

Compare with diff mode

Save as experiment

Share playgrounds

Best practices

Advanced options

Append dataset messages

Max concurrency

Strict variables

Reasoning models

Next steps