Run remote evaluations

Remote evals let you run evaluations on your own infrastructure while using Braintrust’s playground for iteration, comparison, and analysis. Your evaluation code runs on your servers or local machine, and the Braintrust playground sends parameters and receives results through a simple HTTP interface. Use remote evals when your evaluation requires:

Agentic workflows: Multi-step agent flows or complex task logic that goes beyond a single prompt.
Custom infrastructure: Access to internal APIs, databases, or services that can’t run in the cloud.
Specific runtime environments: Custom dependencies, system libraries, or environment configurations.
Security or compliance requirements: Data that must remain on your infrastructure.
Long-running evaluations: Complex processing that exceeds typical execution timeouts.

If your evaluation can run in the Braintrust playground, you don’t need remote evals.

How it works

Write an Eval() with parameters that define runtime configuration options.
Run your eval locally with the --dev flag to expose an HTTP endpoint.
Configure the endpoint URL in your Braintrust project settings.
Use the remote eval in the playground. Parameters appear as UI controls.
When you run the eval, Braintrust sends parameters to your endpoint and displays results.

The playground handles dataset management, scoring, comparison, and visualization while your code handles the task execution.

Set up a remote eval

A remote eval looks like a standard Eval() call with a parameters field that defines configurable options. These parameters become UI controls in the playground.

Load parameters dynamically

Use loadParameters() to reference parameter configurations you’ve created and saved to Braintrust. For TypeScript type inference in your Eval’s task, pass your parameter type as a generic to loadParameters().

Reusability: Share parameter configurations across multiple evaluations
Centralized management: Update parameters in one place for all evaluations
Version control: Track parameter changes independently of evaluation code

import { Eval, loadParameters } from "braintrust";
import { evalConfig } from "./eval-config";

Eval("My Project", {
  experimentName: "With loaded parameters",
  data: async () => [
    { input: "What is 2+2?", expected: "4" },
  ],
  task: async (input, { parameters }) => {
    // Use loaded parameter values
    return await callModel(input, parameters);
  },
  // Load parameters from centrally-defined configuration
  parameters: loadParameters<typeof evalConfig>({
    projectName: "My Project",
    slug: "eval-config",
  }),
});

When using loadParameters() in remote evals, the playground displays a version selector to choose which parameter version to use during evaluation. See Create parameters for details on creating and managing parameter configurations.

Define parameters inline

You can also define parameters inline in your Eval() call. Parameters define runtime configuration that users can modify in the playground without changing code. They appear as form controls in the UI.

remote.eval.ts

import { Levenshtein } from "autoevals";
import { Eval, initDataset, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";

const client = wrapOpenAI(new OpenAI());

Eval("Simple eval", {
  data: initDataset("local dev", { dataset: "sanity" }),
  task: async (input, { parameters }) => {
    const promptInput = parameters.prefix
      ? `${parameters.prefix}: ${input}`
      : input;

    const completion = await client.chat.completions.create(
      parameters.main.build({
        input: promptInput,
      }),
    );
    return completion.choices[0].message.content ?? "";
  },
  scores: [Levenshtein],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "This is the main prompt",
      default: {
        messages: [
          {
            role: "user",
            content: "{{input}}",
          },
        ],
        model: "gpt-4o",
      },
    },
    prefix: z
      .string()
      .describe("Optional prefix to prepend to input")
      .default(""),
  },
});

The parameter system works the same way across languages but uses different syntax:

Feature	TypeScript	Python	Java
Parameter types	`type: "prompt"` for LLM prompts `z.string()`, `z.boolean()`, `z.number()`, `z.array()`, `z.object()` with `.describe()`	`type: "prompt"` for LLM prompts Dictionary with `type: "string"`, `"boolean"`, `"number"`, `"array"`, `"object"`	`type: "prompt"` for LLM prompts `Map` with `type: "string"`, `"boolean"`, `"number"`, `"array"`, `"object"`
Type definition	Zod schemas with chained methods	Dictionary with `type`, `description`, `default` fields	Map with `type`, `description`, `default` fields
Parameter access	Direct property access: `parameters.prefix`	Dictionary access: `parameters["prefix"]` or `parameters.get("prefix")`	Map access: `parameters.get("prefix")` or `parameters.getOrDefault("prefix", default)`
Prompt parameters	`type: "prompt"` with `messages` array directly in `default`	`type: "prompt"` with nested `prompt.messages` and `options` objects	`type: "prompt"` with nested `prompt.messages` and `options` objects
Prompt usage	`parameters.main.build({ input: value })`	`**parameters["main"].build(input=value)`	`parameters.get("main").build(Map.of("input", value))`
Async handling	`async`/`await` with promises	`async`/`await` with coroutines	Synchronous or `CompletableFuture`

When your remote eval runs, Braintrust sends the configured parameter values through the parameters object in your task function.

Expose a remote eval

To make your eval accessible to Braintrust, run it with the --dev flag to start a local server:

TypeScript
Python
Java

Run npx braintrust eval path/to/eval.ts --dev to start the dev server at http://localhost:8300.

Run braintrust eval path/to/eval.py --dev to start the dev server at http://localhost:8300.

Run braintrust eval RemoteEval --dev to start the dev server at http://localhost:8300.

You can configure the host and port:

--dev-host DEV_HOST: The host to bind the dev server to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces (be cautious about security when exposing beyond localhost).
--dev-port DEV_PORT: The port to bind the dev server to. Defaults to 8300.

Once running, your eval exposes an HTTP endpoint that Braintrust can connect to. Keep this process running while using the remote eval in the playground.

Configure remote eval sources

To add remote eval endpoints beyond localhost, configure them at the project level:

In your project, go to Configuration > Remote evals.
Select Remote eval source.
Enter the name and URL of your remote eval server.
Select Create remote eval source.

All team members with access to the project can now use this remote eval in their playgrounds.

Run a remote eval from a playground

After exposing your eval and configuring it in your project, you can use it in any playground:

In a playground, select Task.
Select Remote eval from the task type list.
Choose your eval from the available sources (localhost or configured remote URLs).
Configure parameters using the UI controls that were defined in your parameters object.
Run the evaluation.

Braintrust sends your parameters to the remote endpoint and displays results. You can run multiple instances of the same remote eval side-by-side with different parameters to compare results.

Demo

This video walks through exposing a remote eval to Braintrust and using it in a playground.

Limitations

The dataset defined in your remote eval is ignored. Datasets are managed through the playground.
Scorers defined in remote evals are concatenated with playground scorers.

Next steps

Use playgrounds to compare and analyze results.
Write scorers to evaluate outputs.
Run evaluations programmatically.

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Run remote evaluations

How it works

Set up a remote eval

Load parameters dynamically

Define parameters inline

Expose a remote eval

Configure remote eval sources

Run a remote eval from a playground

Demo

Limitations

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​How it works

​Set up a remote eval

​Load parameters dynamically

​Define parameters inline

​Expose a remote eval

​Configure remote eval sources

​Run a remote eval from a playground

​Demo

​Limitations

​Next steps

How it works

Set up a remote eval

Load parameters dynamically

Define parameters inline

Expose a remote eval

Configure remote eval sources

Run a remote eval from a playground

Demo

Limitations

Next steps