You can't stare at a prompt and know what's going to happen. A word change that fixes one input can break ten others. An instruction that looks clear to you gets interpreted differently by the model depending on context. The only way to know whether a prompt is good is to measure it.
Prompt optimization is an iterative workflow. You write a prompt, score it against real data, find what's failing, fix it, and score again. Each cycle produces a measurably better version. At Braintrust, we call this going from vibes to verified.
This post walks through the prompt optimization loop step by step using a concrete example. If you're looking for prompt engineering fundamentals like few-shot examples, chain-of-thought, and modular prompt design, our guide to systematic prompt engineering covers those techniques in depth. If you want a broader overview of the eval workflow, How to eval: The Braintrust way is a good starting point. Here, we're focused on the process that turns a decent prompt into a reliable one.
Most teams spend their prompt optimization time on the first draft. They choose their words carefully, add a few examples, review the instructions, and test against a handful of inputs. When those inputs look good, they ship.
The problem is that careful writing only gets you so far. LLMs are sensitive to small wording changes. Swapping one word in an instruction can shift accuracy by several percentage points, and you can't predict which direction. A prompt that handles your five test cases perfectly might fail on 30% of real-world inputs in ways you never expected.
The teams that build reliable prompts don't write better first drafts. They run more cycles through the prompt optimization loop. Each cycle reveals a new category of failure, and each fix makes the prompt stronger across the full range of inputs it will see in production.
The prompt optimization loop has five steps. We'll walk through each one with a running example: a prompt that classifies customer support tickets into categories like billing, refunds, technical support, and account access.
Start with a clean first draft. Apply whatever prompt engineering techniques you know. Clear instructions, a defined output format, a few examples if you have them.
Here's a starting point for our classifier:
You are a customer support ticket classifier.
Classify the following support ticket into exactly one category:
- billing
- refunds
- technical_support
- account_access
Respond with only the category name, nothing else.
Ticket: {{input}}
This prompt is clear, focused on one task, and specifies the output format. By most prompt engineering standards, it's fine. The question is whether "fine" holds up across hundreds of real tickets.
A scorer defines what "good" means for your use case. It takes the model's output and returns a numeric grade. For our classifier, the scorer is straightforward: does the predicted category match the correct label?
from typing import Optional
# Returns 1.0 if the predicted category matches the expected label, else 0.0
def handler(
output: str,
expected: Optional[str],
) -> float:
if expected is None:
return 0.0
return 1.0 if output.strip().lower() == expected.strip().lower() else 0.0
This is an exact match scorer. It returns 1 if the model got the category right and 0 if it didn't. Across a full dataset, the average of these scores gives you an accuracy percentage.
For classification, exact match works well. For open-ended generation tasks like summarization or writing, scoring gets harder. You might use a second LLM call to rate quality on a rubric, but LLM judges have their own biases. They tend to prefer longer answers and can grade inconsistently. Building a good scorer for subjective tasks is itself an iterative problem, and that's one more reason the prompt optimization loop matters.
At Braintrust, scorers are first-class objects. You can use our built-in scorers for common checks like factuality, relevance, and similarity. You can write custom scorers like the one above. And for agent workflows, Braintrust supports trace-level scorers that evaluate entire multi-step execution paths rather than just the final output. The key point: whatever scorer you build runs on every experiment, so every prompt revision gets graded the same way.
Five hand-picked examples won't tell you much. You need dozens or hundreds of inputs that represent what your prompt will actually see in production.
For our classifier, that means a dataset of real support tickets with correct labels. The dataset should cover common cases for each category, edge cases where categories overlap, unusual phrasing or formatting, and tickets that could reasonably belong to more than one category.
In Braintrust, you run this as an experiment. The experiment sends every input through your prompt, records the output, runs your scorer, and saves the results as a versioned snapshot. You now have a baseline.
Say our first experiment scores 74% accuracy across 200 tickets. That's the number we need to beat.
The overall accuracy tells you how the prompt performs. The individual failures tell you why.
Look at the tickets the model got wrong. Patterns will emerge. Maybe the model confuses refund requests with billing inquiries 40% of the time. Maybe tickets that mention "charge" get classified as billing even when the customer is asking for a refund. Maybe account lockout tickets get split between technical support and account access depending on whether the customer says "can't log in" versus "locked out."
In Braintrust, you can filter experiment results by score, sort by failures, and inspect individual examples. The side-by-side view shows you exactly what the model predicted versus what the correct label was. This is where you stop guessing and start diagnosing.
The pattern here is clear: the model doesn't have enough context to distinguish refunds from billing. The categories are too close together, and the prompt doesn't explain the difference.
Now you make a targeted fix based on what the scores revealed. You're not rewriting the whole prompt. You're addressing the specific failure pattern you found.
For our classifier, the fix is adding category definitions and a few examples that show the boundary between billing and refunds:
You are a customer support ticket classifier.
Classify the following support ticket into exactly one category:
- billing: Questions about charges, payment methods, invoices, or subscription plans. The customer wants to understand or change how they pay.
- refunds: The customer wants money back for a charge already made. Look for words like "refund," "money back," "cancel and get refund," or "charged incorrectly."
- technical_support: The product isn't working as expected. Bugs, errors, crashes, slow performance.
- account_access: The customer can't get into their account. Login issues, password resets, locked accounts, MFA problems.
Examples:
Ticket: "I was charged twice for my subscription last month"
Category: refunds
Ticket: "How do I switch from monthly to annual billing?"
Category: billing
Ticket: "I got charged but I want my money back, I cancelled last week"
Category: refunds
Respond with only the category name, nothing else.
Ticket: {{input}}
Rerun the same experiment with the same dataset. Braintrust shows the comparison side by side: which tickets improved, which regressed, which stayed the same.
Say accuracy jumps from 74% to 88%. Refund/billing confusion dropped significantly. But you also notice a few new failures. Two tickets that were correct before are now wrong. One edge case about a "billing refund" now goes to billing instead of refunds.
That's normal. Almost every prompt change improves some inputs and regresses others. The prompt optimization loop lets you see both sides. Without it, you'd only notice the improvements and miss the regressions until they hit production.
The first cycle got you from 74% to 88%. Each additional cycle works the same way. Review the remaining failures, identify the next pattern, make a targeted fix, rerun. For more on how to iterate on your evals themselves, including expanding datasets and refining scorers, we've written a separate guide.
Maybe cycle two adds handling for tickets that mention both billing and refund language. Accuracy goes to 92%. Cycle three addresses the account access versus technical support confusion with better examples. Accuracy hits 95%.
Each cycle takes minutes when the tooling is fast. The Playground lets you edit prompts in a browser and see scores in real time, so you don't need to redeploy anything to test a change. You can compare two prompt approaches side by side and see which one scores higher. Product managers can run these cycles without engineering support.
When you want to speed things up further, Braintrust Loop can analyze your failure patterns and suggest specific prompt edits. Loop is an AI assistant built into the platform. It generates prompt revisions, creates scorers, builds eval datasets from production logs, and turns production failures into permanent eval cases. After an experiment run, Loop identifies what went wrong and proposes fixes. You apply the suggestion, rerun, and check the scores.
Optimization is only valuable if you keep the gains. A prompt that scores 95% today can regress next week when someone adds a new instruction, changes the model version, or updates the system message.
Braintrust's GitHub Action runs your evals on every pull request and posts results as comments. You set a quality threshold. If a change drops accuracy below that bar, the merge gets blocked. That gives you the same kind of quality gate for AI that you already use for traditional code.
This is how the prompt optimization loop connects to production. Your eval dataset grows over time as you add new failure cases. Your scorers catch regressions automatically. Every prompt change gets measured before it ships. For a deeper look at how exploration, evaluation, and data collection feed into each other, see our post on AI development loops.
Notion's AI team went from fixing 3 issues per day to 30 after building this workflow. That 10x jump didn't come from writing better prompts on the first try. It came from running the prompt optimization loop faster.
They built hundreds of datasets testing specific criteria. They replaced manual review with automated scoring. They fed production traces directly into eval cases, so real-world failures became permanent regression checks. Every prompt change got measured against real data before it shipped.
At Braintrust, we built experiments, scorers, the Playground, Loop, and CI/CD gates to make the prompt optimization loop fast enough to run in a single sitting. Sign up on our free tier to get 1 GB of processed data, 10k scores, and unlimited users.
Run your first experiment in under an hour. See the difference between "this feels right" and "this scores 95% on 200 eval cases."