How Dropbox automates evals for conversational AI

15 October 2025Ornella Altunyan9 min

This post is adapted from A practical blueprint for evaluating conversational AI at scale by Ranjitha Gurunath Kulkarni, Ameya Bhatawdekar, and Gonzalo Garcia from the Dropbox team.

Dropbox is a leading cloud storage and collaboration platform serving millions of users worldwide. With Dropbox Dash, they launched an AI-powered universal search and organization tool that helps users find, organize, and protect their work across all their connected apps.

Building Dash taught the Dropbox team a critical lesson: in the foundation-model era, AI evaluation matters just as much as model training. Behind Dash's simple text interface runs a complex chain of probabilistic stages: intent classification, document retrieval, ranking, prompt construction, model inference, and safety filtering. A tweak to any link in this chain can ripple unpredictably through the pipeline, turning yesterday's perfect answer into today's hallucination.

In this case study, we'll explore how Dropbox built a structured evaluation framework that treats every experiment like production code.

The challenge: From ad-hoc testing to systematic evaluation

In the beginning, Dropbox's evaluations were unstructured, more ad-hoc testing than a systematic approach. The team would experiment with different models, retrieval methods, and prompts, but without a standardized process to measure impact. They noticed that real progress came not just from model selection, but from how they shaped the processes: refining retrieval, tweaking prompts, and balancing consistency with variety in answers.

The breakthrough came when they decided to make evaluation more rigorous. Their rule was straightforward: handle every change with the same care as shipping new code. Every update had to pass testing before it could be merged, with evaluation baked into every step of their process rather than tacked on at the end.

Building the evaluation framework

Step 1: Curate the right datasets

The team started with publicly available datasets to establish baselines. For question answering, they drew on Google's Natural Questions, Microsoft Machine Reading Comprehension (MS MARCO), and MuSiQue. Each brought different strengths: Natural Questions tested retrieval from very large documents, MS MARCO emphasized handling multiple document hits for a single query, and MuSiQue challenged multi-hop question answering.

But public datasets alone weren't enough. To capture the long tail of real-world usage, Dropbox built internal datasets from production logs of Dropbox employees dogfooding Dash. They created two types: representative query datasets that mirrored actual user behavior by anonymizing and ranking top internal queries, with annotations from proxy labels or internal annotators, and representative content datasets that focused on the material users rely on most. From widely shared files, documentation, and connected data sources, they used LLMs to generate synthetic questions and answers spanning diverse cases like tables, images, tutorials, and factual lookups.

Together, these public and internal datasets gave Dropbox a comprehensive test suite that mirrored real-world complexity.

Step 2: Define actionable metrics with LLM judges

Traditional metrics like BLEU, ROUGE, and BERTScore were useful for quick checks, but they couldn't enforce deployment-ready correctness. The team saw high ROUGE scores even when answers skipped citing sources, strong BERTScore results alongside hallucinated file names, and fluent outputs that buried factual errors.

The solution: use LLMs themselves as judges. A judge model can check factual correctness against ground truth, assess whether every claim is properly cited, enforce formatting and tone requirements, and scale across dimensions that traditional metrics ignore.

Dropbox structured their LLM judges like software modules: designed, calibrated, tested, and versioned. Each evaluation run takes the query, the model's answer, the source context, and occasionally a reference answer. The judge prompt guides the process through structured questions:

Does the answer directly address the query?
Are all factual claims supported by the provided context?
Is the answer clear, well-formatted, and consistent in voice?

The judge responds with both justification and a score:

json

{
  "factual_accuracy": 4,
  "citation_correctness": 1,
  "clarity": 5,
  "formatting": 4,
  "explanation": "The answer was mostly accurate but referenced
a source not present in context."
}

Every few weeks, the team ran spot-checks on sampled outputs and labeled them manually. These calibration sets let them tune judge prompts, benchmark agreement rates between humans and models, and track drift over time.

To make the system enforceable, they defined three types of metrics:

Metric type	Examples	Enforcement logic
Boolean gates	"Citations present?", "Source present?"	Hard fail: changes can't move forward
Scalar budgets	Source F1 ≥ 0.85, p95 latency ≤ 5s	Block deploying any changes that affect the test
Rubric scores	Tone, formatting, narrative quality	Logged in dashboards, monitored over time

Step 3: Adopt an evaluation platform

Once datasets and metrics were in place, Dropbox needed more structure. Managing scattered artifacts and experiments wasn't sustainable. That's when they adopted Braintrust.

The platform gave them four key capabilities. First, a central store providing a unified, versioned repository for datasets and experiment outputs. Second, an experiment API where each run was defined by its dataset, endpoint, parameters, and scorers, producing an immutable run ID. Third, dashboards enabling side-by-side comparisons that highlighted regressions instantly and quantified trade-offs across latency, quality, and cost. Finally, trace-level debugging where one click revealed retrieval hits, prompt payloads, generated answers, and judge critiques.

Spreadsheets broke down fast once real experimentation began. Results were scattered, hard to reproduce, and nearly impossible to compare side by side. Braintrust gave the team a shared place where every run was versioned, every result could be reproduced, and regressions surfaced automatically.

Step 4: Automate evaluation in the dev-to-prod pipeline

Dropbox treated prompts, context selection settings, and model choices just like any other application code. They had to pass the same automated checks.

Every pull request kicked off approximately 150 canonical queries, judged automatically and returned results in under 10 minutes. Once merged, the system reran the full suite along with quick smoke checks for latency and cost. If anything crossed a red line, the merge was blocked.

Dev event	Trigger	What runs	SLA
Pull request opened	GitHub Action	~150 canonical queries, judged by scorers	Results return in under ten minutes
Pull request merged	GitHub Action	Canonical suite plus smoke checks	Merge blocked on any red-line miss

These canonical queries were small in number but carefully chosen to cover critical scenarios: multiple document connectors, "no-answer" cases, and non-English queries. Each test recorded the exact retriever version, prompt hash, and model choice to guarantee reproducibility.

For larger refactors, on-demand synthetic sweeps handled the complexity. These sweeps began with a golden dataset and ran hundreds of requests in parallel as a Kubeflow DAG. Each run logged under a unique run_id, making it easy to compare against the last accepted baseline.

Live-traffic scoring continuously sampled production traffic and scored it with the same metrics as offline suites. Dashboards tracked rolling quality and performance medians over one-hour, six-hour, and 24-hour intervals. If metrics drifted beyond thresholds (like a sudden drop in source F1 or a spike in latency), alerts triggered immediately. Because scoring ran asynchronously in parallel with user requests, production traffic saw no added latency.

The team controlled risk through layered gates as changes moved through the pipeline. The merge gate ran curated regression tests on every change. The stage gate expanded coverage to larger, more diverse datasets with stricter thresholds. The production gate continuously sampled real traffic. By progressively scaling dataset size and realism at each gate, Dropbox blocked regressions early while ensuring staging and production evaluations stayed aligned with real-world behavior.

Step 5: Close the loop with continuous improvement

Every poorly scored query carried a lesson. By mining low-rated traces from live traffic, the team uncovered failure patterns that synthetic datasets missed: retrieval gaps on rare file formats, prompts cut off by context windows, inconsistent tone in multilingual inputs, or hallucinations triggered by underspecified queries.

These hard negatives flowed directly into the next dataset iteration. Some became labeled examples in the regression suite, while others spawned new variants in synthetic sweeps. This built a virtuous cycle where the system was stress-tested on exactly the edge cases where it once failed.

For riskier experiments like a new chunking policy or reranking model, the team built a structured A/B playground where they could run controlled experiments against consistent baselines without consuming production bandwidth.

The results

By treating evaluation as a first-class discipline, Dropbox transformed their AI development process. Engineers could iterate rapidly with speed and confidence, knowing that automated gates caught regressions before they reached users. Dashboards and metrics created a shared understanding and common language for discussing AI quality across teams. Production failures automatically became test cases, creating a continuous learning loop that drove systematic improvement. The same evaluation logic gated every prompt tweak and retriever update, ensuring predictable quality with consistency and traceability.

One of the biggest surprises was how many regressions came not from swapping models but from editing prompts. A single word change in an instruction could tank citation accuracy or formatting quality. Formal gates, not human eyeballs, became the only reliable safety net.

Key takeaways

Based on Dropbox's experience building evaluation infrastructure for Dash, here are the critical lessons:

Treat evaluation with the same rigor as production code. Version datasets, automate checks, and make evaluation part of your standard development workflow.
Judge models and rubrics need their own evaluation. Prompts, instructions, and judge model choices can change outcomes. Regular calibration against human labels is essential.
Layer your gates progressively. Start with fast regression tests on PRs, expand to comprehensive datasets in staging, and continuously sample production traffic.
Mine production failures for dataset improvement. Every low-scoring output is a chance to improve your test suite and catch similar issues in the future.
Prompt changes are code changes. Small edits to instructions can cause major regressions. Automated evaluation is the only way to catch them consistently.

Conclusion

Dropbox Dash demonstrates that systematic evaluation turns probabilistic LLMs into dependable products. By anchoring their development process in rigorous datasets, actionable metrics, and automated gates, the Dropbox team built an AI product that users trust.

If you're building conversational AI at scale and want to replicate Dropbox's evaluation-first approach, get in touch.

Learn more about Dropbox Dash and Braintrust.

Thank you to Ranjitha Gurunath Kulkarni, Ameya Bhatawdekar, and Gonzalo Garcia from the Dropbox team for sharing these insights!

How Dropbox automates evals for conversational AI

Bring structure to your AI agent development