Dataset

/'day.tuh.seht/A versioned collection of test cases used to run evals and track improvements over time. Versioning matters because it keeps comparisons reproducible. (noun)

“We added 50 new examples to the dataset and re-ran the eval.”

Related Datasets terms

Adversarial examples

•

Coverage

•

Dataset record

•

Edge case

•

Expected output (ground truth)

•

flush()

•

Golden dataset

•

Input

•

Metadata

•

Multimodal dataset

From the docs

Build datasets

•

Create experiments

•

Evaluate systematically

•

Add to dataset captures root span only

•

Configure dataset schemas via API

Get started with Evals

Braintrust is the AI observability and eval platform for production AI. By connecting evals and observability in one workflow, teams at Notion, Stripe, Zapier, Vercel, and Ramp ship quality AI products at scale.