Remote evals

If you have existing infrastructure for running evaluations that isn't easily adaptable to the Braintrust Playground, you can use remote evals to expose a remote endpoint. This lets you run evaluations directly in the playground, iterate quickly across datasets, run scorers, and compare results with other tasks. You can also run multiple instances of your remote eval side-by-side with different parameters and compare results. Parameters defined in the remote eval will be exposed in the playground UI.

Remote evals are in beta. If you are on a hybrid deployment, remote evals are available starting with v0.0.66.

Expose remote `Eval`

To expose an Eval running at a remote URL or your local machine, simply pass in the --dev flag. For example, given the following file, run npx braintrust eval parameters.eval.ts --dev to start the dev server and expose http://localhost:8300. The dev host and port can also be configured:

--dev-host DEV_HOST: The host to bind the dev server to. Defaults to localhost. Set to 0.0.0.0 to bind to all interfaces.
--dev-port DEV_PORT: The port to bind the dev server to. Defaults to 8300.

import { Levenshtein } from "autoevals";
import { Eval, initDataset, wrapOpenAI } from "braintrust";
import OpenAI from "openai";
import { z } from "zod";
 
const client = wrapOpenAI(new OpenAI());
 
Eval("Simple eval", {
  data: initDataset("local dev", { dataset: "sanity" }), // Datasets are currently ignored
  task: async (input, { parameters }) => {
    const completion = await client.chat.completions.create(
      parameters.main.build({
        input: `${parameters.prefix}:${input}`,
      }),
    );
    return completion.choices[0].message.content ?? "";
  },
  // These scores will be used along with any that you configure in the UI
  scores: [Levenshtein],
  parameters: {
    main: {
      type: "prompt",
      name: "Main prompt",
      description: "This is the main prompt",
      default: {
        messages: [
          {
            role: "user",
            content: "{{input}}",
          },
        ],
        model: "gpt-4o",
      },
    },
    another: {
      type: "prompt",
      name: "Another prompt",
      description: "This is another prompt",
      default: {
        messages: [
          {
            role: "user",
            content: "{{input}}",
          },
        ],
        model: "gpt-4o",
      },
    },
    include_prefix: z
      .boolean()
      .default(false)
      .describe("Include a contextual prefix"),
    prefix: z
      .string()
      .describe("The prefix to include")
      .default("this is a math problem"),
    array_of_objects: z
      .array(
        z.object({
          name: z.string(),
          age: z.number(),
        }),
      )
      .default([
        { name: "John", age: 30 },
        { name: "Jane", age: 25 },
      ]),
  },
});

The dataset defined in your remote eval will be ignored. Scorers defined in remote evals will be concatenated with playground scorers.
Remote evals are limited to TypeScript only. Python support is coming soon.

Remote evals

Expose remote `Eval`

Running a remote eval from a playground

Configure remote eval sources

Limitations

On this page

Remote evals

Expose remote Eval

Running a remote eval from a playground

Configure remote eval sources

Limitations

On this page

Expose remote `Eval`