Start with a small dataset
Begin with 5-10 representative examples that cover your core use cases. A small, well-chosen dataset is more useful than a large dataset of easy cases. Expand your dataset incrementally, guided by actual failures from production or previous experiments — not by synthetic construction of edge cases you imagine might exist. Synthetic data is fine for bootstrapping, but real-world failures reveal what actually goes wrong.Validate your scoring system
Your evaluation is only as reliable as your scorers. Before trusting score trends, confirm that your scorers actually measure what you intend:- Test obvious cases: Run the scorer on inputs where the correct score is clear — a clearly correct output should score near 1, a clearly wrong one near 0. If it doesn’t, the scorer needs work.
- Read the reasoning: LLM-as-a-judge scorers include a chain-of-thought rationale in the score span. Open a few traces and read it. Is the judge reasoning about the right things? Is it consistent?
- Watch for bias: LLM judges tend to favor longer, more formal, or more confident-sounding outputs regardless of accuracy. Test for this explicitly by comparing a correct short answer against a wrong but verbose one.
Run evaluations in a loop
Running an evaluation once tells you where you are. Running evaluations in a continuous loop is how you improve. Each iteration surfaces new failures, expands your dataset with real examples, and lets you measure whether a change actually helped — without breaking what already works.Identify failures
Start with your production logs or the last experiment. Sort by score to find the lowest-performing cases and look for patterns — do failures cluster around a particular topic, input type, or user intent?Use Loop to analyze patterns across many cases at once:
- “What do the low-scoring cases have in common?”
- “Categorize the failures in this experiment”
- “Which input types perform worst?”
Expand your dataset
Add the failing cases to your dataset. Real failures are more valuable than synthetic examples — they reflect actual user behavior and surface edge cases you wouldn’t have thought to construct.Use topics to find clusters of similar production logs and pull them into a dataset in bulk. This is especially useful when failures share a common pattern (e.g., refund requests, multi-step instructions, ambiguous inputs).
Establish a baseline
Before making any changes, run an experiment against your updated dataset. This is your baseline — the number you’re trying to beat.Record the baseline experiment name or ID so you can reference it when comparing later. Don’t rely on memory or approximate comparisons.
Make a targeted change
Change one thing: a prompt instruction, a system message, a model, a parameter. Changing multiple things at once makes it impossible to know what caused any improvement or regression you observe.If your dataset reveals a pattern — for example, that the model handles refund requests poorly — write a focused fix (a new instruction or a few-shot example) rather than rewriting the whole prompt.
Verify without regression
Run a new experiment and compare it against your baseline. Look for:
- Improvement on the cases you targeted
- No regressions on cases that were already passing
- Score changes that reflect real output quality changes, not scorer noise
Repeat
Merge the new dataset rows into your main dataset and update your baseline to the latest experiment. The next cycle starts with broader coverage than the last.Over time, your dataset grows to reflect the full distribution of real-world inputs. Your evals become more reliable. Your baselines become harder to beat — which means improvements are real.
- Online scoring rules run continuously on production traffic and surface low-quality interactions.
- Interesting traces get added to datasets via the UI or SDK.
- Offline experiments run against those datasets, testing fixes before they ship.
- Deployed changes are monitored by online scoring again.
Change one variable at a time
When you change multiple things between experiments — prompt, model, parameters, scorer — you can’t tell which change caused the result. If scores improve, you don’t know what to keep. If they regress, you don’t know what to revert. Make one change per experiment. This takes more runs but produces interpretable results. The only exception is when you’re doing a full system overhaul and want a rough directional signal — but even then, plan to isolate variables before shipping.Account for nondeterminism
LLM outputs are nondeterministic. A single experiment run can make a bad change look good, or mask a real improvement. Rather than running the same experiment multiple times, use a larger dataset or increase the trial count within a single experiment — both take full advantage of concurrency and give you more signal without the overhead of repeated runs. Compare averages over individual results. This matters most when score differences are small (under 5 percentage points). If results vary significantly, your scorer or dataset may need more signal.Keep your baseline current
Always compare against the version of your system that is actually in production, not an old experiment from months ago. A baseline that doesn’t reflect current behavior makes comparisons meaningless. Update your baseline whenever you make a change to the prompt, model, or scorer. If you’re unsure which experiment represents the current state, check your deployment history or run a fresh baseline before making changes. Stale baselines are one of the most common sources of misleading eval results.Segment results by metadata
Aggregate scores hide problems. An overall score of 0.85 can mask a score of 0.40 on a specific input type that matters to your users. Add metadata to your test cases (topic, input category, user intent, etc.) and use group by in the experiments table to break down results by category. Sort by regressions to find which segments got worse.Next steps
- Interpret results from experiments
- Compare experiments systematically
- Write scorers that measure what matters
- Score production traces to surface failures automatically