Latest articles

Testing different models with different prompts: A hands-on guide with Braintrust

21 August 2025Braintrust Team

When crafting AI features, the right model and prompt choice can make all the difference. In this post, I'll walk you through a workflow that lets you confidently test combinations at scale—and why Braintrust is the most developer‑friendly way to do it.

Why testing models with prompts matters

Every AI developer I've worked with eventually asks the same thing: "Which model should I use—and with what prompt?"

When you're dealing with real user queries or production traffic, your instincts only take you so far. You need evidence:

  • To understand which prompts produce the most accurate results
  • To measure cost and latency
  • To ensure changes are reproducible
  • And to avoid guesswork

Braintrust is built for that. It turns model and prompt testing from guesswork into rigorous, measurable experiments.

How traditional testing falls short

  • Manual evaluations are slow, brittle, and hard to reproduce
  • Small sample sizes mean you end up trusting gut over data
  • Model outputs aren't tied to prompt versions, or runtimes, or provider details—so if something breaks, it's hard to debug

You don't have to choose accuracy, cost, or speed blindly—Braintrust lets you compare all three across model/prompt combinations.

A systematic model × prompt testing workflow

The developer-friendly way I approach this is to treat the problem as a matrix of tests:

ModelPrompt APrompt BPrompt C
GPT‑4oScoreScoreScore
GPT‑3.5‑turboScoreScoreScore
Claude 3 HaikuScoreScoreScore

Each cell represents running a dataset through a model with a specific prompt, scoring the output, and measuring cost/latency. Braintrust tooling covers all these parts in one platform.

Step 1: Define your dataset

Grab a representative set of inputs—like actual customer queries, support tickets, or document summaries. In Braintrust, you store these as datasets, keeping them versioned and shareable across projects.

Step 2: Manage prompt versions in Braintrust

Braintrust treats prompts as versioned, first-class objects. You can author, update, and track them alongside code.

"You are a support agent. Respond clearly and accurately using product documentation.
Question: {{input}}"

You can manage changes, pin by version ID, and understand how revisions affect results.

Step 3: Choose models via the LLM proxy

Bridge between providers—OpenAI, Claude, Mistral, AWS Bedrock, Vertex—and your code using Braintrust's unified LLM proxy. Swap models without rewrite.

import braintrust as bt
 
result = bt.llm.complete(model="gpt‑4o", prompt="Summarize: {{input}}", variables={"input": "Text here…"})

Step 4: Set up scorers for automated evaluation

Use autoevals or custom scoring functions to measure performance:

def contains_keywords(output, expected_keywords):
    return all(k in output for k in expected_keywords)
 
 
bt.register_scorer("contains_keywords", contains_keywords)

This gives you pass/fail metrics instead of eyeballing outputs.

Step 5: Run experiments in the UI

Braintrust's experiments UI allows you to:

  • Select models, prompt versions, datasets, and scorers
  • Run batches of test cases
  • Inspect side‑by‑side comparisons of accuracy, latency, and cost

You can drill into individual examples, understand regressions, and fine‑tune your configurations.

Why Braintrust shines for model vs. prompt testing

  • Unified versioning — prompts, datasets, models, scorers all live together and are reproducible
  • Cross‑provider flexibility — switch between LLM providers without code churn
  • Observable and fast — Brainstore's optimized backend powers fast search and analytics
  • Real‑time iteration — use Playgrounds to prototype combinations before formal experiments

Example: Picking the best model for a support bot

  1. Gather 120 real questions
  2. Author three prompt variants
  3. Compare:
    • GPT‑4o
    • GPT‑3.5-turbo
    • Claude 3 Haiku
  4. Measure accuracy, latency, and estimated cost

Findings:

  • GPT‑4o was most accurate—but 3× the cost target
  • Claude 3 Haiku hit ~92% of GPT‑4o's accuracy at ~30% of the cost and with lower latency

The outcome: use Claude 3 Haiku in prod, with GPT‑4o as fallback—all orchestrated and benchmarked in Braintrust.

Key takeaways

  • Prompt quality matters as much as model choice
  • Automated, reproducible scoring beats visual inspection
  • Dataset coverage keeps testing aligned with real usage
  • Latency and cost are first‑class metrics—not afterthoughts

Wrap up

Model and prompt testing doesn't have to be fragmented. Braintrust brings together versioning, evaluation, observability, and provider flexibility in a developer-first platform.

If you're ready to test with confidence—and choose better model/prompts based on data—start experimenting with Braintrust today. Your AI workflows will become clearer, faster, and more resilient.