Evaluate reasoning models

Reasoning models like OpenAI’s o4, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2.5 Flash generate intermediate thinking steps before producing final responses. Braintrust provides unified support for these models across providers.

Hybrid deployments require v0.0.74 or later for reasoning support.

Configure reasoning

Three parameters control reasoning behavior:

reasoning_effort: Intensity of reasoning (low, medium, or high). Compatible with OpenAI’s parameter.
reasoning_enabled: Boolean to explicitly enable/disable reasoning output (no effect on OpenAI models, which default to “medium”)
reasoning_budget: Token budget for reasoning (use either reasoning_effort or reasoning_budget, not both)

Use in code

Braintrust provides type augmentation for reasoning parameters:

TypeScript: @braintrust/proxy/types extends OpenAI SDK types
Python: braintrust-proxy provides casting utilities and type-safe helpers

Basic usage

import { OpenAI } from "openai";
import "@braintrust/proxy/types";

const openai = new OpenAI({
  baseURL: `${process.env.BRAINTRUST_API_URL || "https://api.braintrust.dev"}/v1/proxy`,
  apiKey: process.env.BRAINTRUST_API_KEY,
});

const response = await openai.chat.completions.create({
  model: "claude-sonnet-4-5-20250929",
  reasoning_effort: "medium",
  messages: [
    {
      role: "user",
      content: "What's 15% of 240?",
    },
  ],
});

// Access final response
console.log(response.choices[0].message.content);
// Output: "15% of 240 is 36."

// Access reasoning steps
console.log(response.choices[0].reasoning);
// Output: Array of reasoning objects with step-by-step calculation

Reasoning structure

Reasoning steps include unique IDs and content:

[
  {
    "id": "reasoning_step_1",
    "content": "I need to calculate 15% of 240..."
  },
  {
    "id": "reasoning_step_2",
    "content": "240 × 0.15 = 36..."
  }
]

The id field contains provider-specific signatures that must be preserved in multi-turn conversations. Always use exact IDs returned by the provider.

Stream reasoning

Reasoning streams through delta.reasoning in streaming responses:

import { OpenAI } from "openai";
import "@braintrust/proxy/types";

const openai = new OpenAI({
  baseURL: `${process.env.BRAINTRUST_API_URL || "https://api.braintrust.dev"}/v1/proxy`,
  apiKey: process.env.BRAINTRUST_API_KEY,
});

const stream = await openai.chat.completions.create({
  model: "claude-sonnet-4-5-20250929",
  reasoning_effort: "high",
  stream: true,
  messages: [
    {
      role: "user",
      content: "Explain quantum entanglement in simple terms.",
    },
  ],
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;

  // Handle regular content
  if (delta?.content) {
    process.stdout.write(delta.content);
  }

  // Handle reasoning deltas
  if (delta?.reasoning) {
    console.log("\nReasoning step:", delta.reasoning);
  }
}

Multi-turn conversations

Include reasoning from previous turns to let models build on earlier thinking:

import { OpenAI } from "openai";
import "@braintrust/proxy/types";

const openai = new OpenAI({
  baseURL: `${process.env.BRAINTRUST_API_URL || "https://api.braintrust.dev"}/v1/proxy`,
  apiKey: process.env.BRAINTRUST_API_KEY,
});

const firstResponse = await openai.chat.completions.create({
  model: "claude-sonnet-4-5-20250929",
  reasoning_effort: "medium",
  messages: [
    {
      role: "user",
      content: "What's the best approach to solve a complex math problem?",
    },
  ],
});

// Include previous reasoning in next turn
const secondResponse = await openai.chat.completions.create({
  model: "claude-sonnet-4-5-20250929",
  reasoning_effort: "medium",
  messages: [
    {
      role: "user",
      content: "What's the best approach to solve a complex math problem?",
    },
    {
      role: "assistant",
      content: firstResponse.choices[0].message.content,
      reasoning: firstResponse.choices[0].reasoning,
    },
    {
      role: "user",
      content: "Now apply that approach to solve: 2x² + 5x - 3 = 0",
    },
  ],
});

Test in playgrounds

Use playgrounds to test reasoning models interactively:

Select a reasoning-capable model
Set reasoning_effort in parameters
Run evaluations
View reasoning steps in trace view

Reasoning steps appear as separate spans in the trace, making it easy to understand the model’s thinking process.

Evaluate reasoning quality

Create scorers that evaluate both final outputs and reasoning steps:

project.scorers.create({
  name: "Reasoning quality",
  slug: "reasoning-quality",
  messages: [
    {
      role: "user",
      content:
        'Evaluate the reasoning steps: {{reasoning}}\n\nAre they logical and complete? Return "A" for excellent, "B" for adequate, "C" for poor.',
    },
  ],
  model: "gpt-4o",
  choiceScores: {
    A: 1,
    B: 0.5,
    C: 0,
  },
});

This helps you understand whether models are using sound reasoning paths to reach conclusions.

Next steps

Run evaluations with reasoning models
Write scorers to evaluate reasoning quality
Use playgrounds to test reasoning interactively
Compare experiments across reasoning efforts

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Evaluate reasoning models

Configure reasoning

Use in code

Basic usage

Reasoning structure

Stream reasoning

Multi-turn conversations

Test in playgrounds

Evaluate reasoning quality

Next steps

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Configure reasoning

​Use in code

​Basic usage

​Reasoning structure

​Stream reasoning

​Multi-turn conversations

​Test in playgrounds

​Evaluate reasoning quality

​Next steps

Configure reasoning

Use in code

Basic usage

Reasoning structure

Stream reasoning

Multi-turn conversations

Test in playgrounds

Evaluate reasoning quality

Next steps