Testing different models with different prompts: A hands-on guide with Braintrust

21 August 2025Braintrust Team

When crafting AI features, the right model and prompt choice can make all the difference. In this post, I'll walk you through a workflow that lets you confidently test combinations at scale—and why Braintrust is the most developer‑friendly way to do it.

Why testing models with prompts matters

Every AI developer I've worked with eventually asks the same thing: "Which model should I use—and with what prompt?"

When you're dealing with real user queries or production traffic, your instincts only take you so far. You need evidence:

To understand which prompts produce the most accurate results
To measure cost and latency
To ensure changes are reproducible
And to avoid guesswork

Braintrust is built for that. It turns model and prompt testing from guesswork into rigorous, measurable experiments.

How traditional testing falls short

Manual evaluations are slow, brittle, and hard to reproduce
Small sample sizes mean you end up trusting gut over data
Model outputs aren't tied to prompt versions, or runtimes, or provider details—so if something breaks, it's hard to debug

You don't have to choose accuracy, cost, or speed blindly—Braintrust lets you compare all three across model/prompt combinations.

A systematic model × prompt testing workflow

The developer-friendly way I approach this is to treat the problem as a matrix of tests:

Model	Prompt A	Prompt B	Prompt C
GPT‑4o	Score	Score	Score
GPT‑3.5‑turbo	Score	Score	Score
Claude 3 Haiku	Score	Score	Score

Each cell represents running a dataset through a model with a specific prompt, scoring the output, and measuring cost/latency. Braintrust tooling covers all these parts in one platform.

Step 1: Define your dataset

Grab a representative set of inputs—like actual customer queries, support tickets, or document summaries. In Braintrust, you store these as datasets, keeping them versioned and shareable across projects.

Step 2: Manage prompt versions in Braintrust

Braintrust treats prompts as versioned, first-class objects. You can author, update, and track them alongside code.

"You are a support agent. Respond clearly and accurately using product documentation.
Question: {{input}}"

You can manage changes, pin by version ID, and understand how revisions affect results.

Step 3: Choose models via the LLM proxy

Bridge between providers—OpenAI, Claude, Mistral, AWS Bedrock, Vertex—and your code using Braintrust's unified LLM proxy. Swap models without rewrite.

python

import braintrust as bt

result = bt.llm.complete(model="gpt‑4o", prompt="Summarize: {{input}}", variables={"input": "Text here…"})

Step 4: Set up scorers for automated evaluation

Use autoevals or custom scoring functions to measure performance:

python

def contains_keywords(output, expected_keywords):
    return all(k in output for k in expected_keywords)


bt.register_scorer("contains_keywords", contains_keywords)

This gives you pass/fail metrics instead of eyeballing outputs.

Step 5: Run experiments in the UI

Braintrust's experiments UI allows you to:

Select models, prompt versions, datasets, and scorers
Run batches of test cases
Inspect side‑by‑side comparisons of accuracy, latency, and cost

You can drill into individual examples, understand regressions, and fine‑tune your configurations.

Why Braintrust shines for model vs. prompt testing

Unified versioning — prompts, datasets, models, scorers all live together and are reproducible
Cross‑provider flexibility — switch between LLM providers without code churn
Observable and fast — Brainstore's optimized backend powers fast search and analytics
Real‑time iteration — use Playgrounds to prototype combinations before formal experiments

Example: Picking the best model for a support bot

Gather 120 real questions
Author three prompt variants
Compare:
- GPT‑4o
- GPT‑3.5-turbo
- Claude 3 Haiku
Measure accuracy, latency, and estimated cost

Findings:

GPT‑4o was most accurate—but 3× the cost target
Claude 3 Haiku hit ~92% of GPT‑4o's accuracy at ~30% of the cost and with lower latency

The outcome: use Claude 3 Haiku in prod, with GPT‑4o as fallback—all orchestrated and benchmarked in Braintrust.

Key takeaways

Prompt quality matters as much as model choice
Automated, reproducible scoring beats visual inspection
Dataset coverage keeps testing aligned with real usage
Latency and cost are first‑class metrics—not afterthoughts

Wrap up

Model and prompt testing doesn't have to be fragmented. Braintrust brings together versioning, evaluation, observability, and provider flexibility in a developer-first platform.

If you're ready to test with confidence—and choose better model/prompts based on data—start experimenting with Braintrust today. Your AI workflows will become clearer, faster, and more resilient.