Eval playgrounds

Playgrounds are a powerful workspace for rapidly iterating on AI engineering primitives. Tune prompts, models, scorers and datasets in an editor-like interface, and run full evaluations in real-time, side by side.

Use playgrounds to build and test hypotheses and evaluation configurations in a flexible environment. Playgrounds leverage the same underlying Eval structure as experiments, with support for running thousands of dataset rows directly in the browser. Collaborating with teammates is also simple with a shared URL.

Playgrounds are designed for quick prototyping of ideas. When a playground is run, its previous generations are overwritten. You can create experiments from playgrounds when you need to capture an immutable snapshot of your evaluations for long-term reference or point-in-time comparison.

Creating a playground

A playground includes one or more evaluation tasks, one or more scorers, and optionally, a dataset.

You can create a playground by navigating to Evaluations > Playgrounds, or by selecting Create playground with prompt at the bottom of a prompt dialog.

Empty Playground

Tasks

Tasks define LLM instructions. There are three types of tasks:

  • Prompts: AI model, prompt messages, parameters, and tools.

  • Agents: A chain of prompts.

  • Remote evals: Prompts and scorers from external sources.

AI providers must be configured before playgrounds can be run.

An empty playground will prompt you to create a base task, and optional comparison tests. The base task is used as the source when diffing output traces.

Base task empty playground

When you select Run (or the keyboard shortcut Cmd/Ctrl+Enter), each task runs in parallel and the results stream into the grid below. You can also choose to view in list or summary layout.

For multimodal workflows, supported attachments will have a preview shown in the inline embedded view.

Scorers

Scorers quantify the quality of evaluation outputs using an LLM judge or code. You can use built-in autoevals for common evaluation scenarios to help you get started quickly, or write custom scorers tailored to your use case.

To add a scorer, select + Scorer and choose from the list or create a custom scorer.

Add scorer

Datasets

Datasets provide structured inputs, expected values, and metadata for evaluations.

A playground can be run without a dataset to view a single set of task outputs, or with a dataset to view a matrix of outputs for many inputs.

Datasets can be linked to a playground by selecting existing library datasets, or creating/importing a new one.

Once you link a dataset, you will see a new row in the grid for each record in the dataset. You can reference the data from each record in your prompt using the input, expected, and metadata variables. The playground uses mustache syntax for templating:

Prompt with dataset

Each value can be arbitrarily complex JSON, for example, {{input.formula}}. If you want to preserve double curly brackets {{ and }} as plain text in your prompts, you can change the delimiter tags to any custom string of your choosing. For example, if you want to change the tags to <% and %>, insert {{=<% %>=}} into the message, and all strings below in the message block will respect these delimiters:

{{=<% %>=}}
Return the number in the following format: {{ number }}

<% input.formula %>

Dataset edits in playgrounds edit the original dataset.

Running a playground

To run a playground, select the Run button at the top of the playground to run all tasks and all dataset rows. You can also run a single task individually, or run a single dataset row.

Viewing traces

Select a row in the results table to compare evaluation traces side-by-side. This allows you to identify differences in outputs, scores, metrics, and input data.

Trace viewer

From this view, you can also run a single row by selecting Run row.

Diffing

Diffing allows you to visually compare variations across models, prompts, or agents to quickly understand differences in outputs.

To turn on diff mode, select the diff toggle.

Creating experiment snapshots

Experiments formalize evaluation results for comparison and historical reference. While playgrounds are better for fast, iterative exploration, experiments are immutable, point-in-time evaluation snapshots ideal for detailed analysis and reporting.

To create an experiment from a playground, select + Experiment. Each playground task will map to its own experiment.

Advanced options

Appended dataset messages

You may sometimes have additional messages in a dataset that you want to append to a prompt. This option lets you specify a path to a messages array in the dataset. For example, if input is specified as the appended messages path and a dataset row has the following input, all prompts in the playground will run with additional messages.

[
  {
    "role": "assistant",
    "content": "Is there anything else I can help you with?"
  },
  {
    "role": "user",
    "content": "Yes, I have another question."
  }
]

Max concurrency

The maximum number of tasks/scorers that will be run concurrently in the playground. This is useful for avoiding rate limits (429 - Too many requests) from AI providers.

Strict variables

When this option is enabled, evaluations will fail if the dataset row does not include all of the variables referenced in prompts.

Sharing playgrounds

Playgrounds are designed for collaboration and automatically synchronize in real-time.

To share a playground, copy the URL and send it to your collaborators. Your collaborators must be members of your organization to see the session. You can invite users from the settings page.

On this page