Glossary
This glossary defines key terms and concepts used in our product and documentation.
Agent
A type of task that can be used in playgrounds. Consists of a chained sequence of prompts that automate complex workflows, where one LLM call’s output feeds into the next.
Automation
A configured workflow that lets you trigger actions based on specific events in Braintrust. For example, sending an alert.
Benchmark
An evaluation designed to assess model performance across specific capabilities or against industry standards.
Brainstore
The high-performance data engine backing logs, search, and tables.
BTQL
Braintrust Query Language: a SQL-like syntax for querying eval results, logs, and metrics.
Configuration
Project-level settings that define behavior for evals, experiments, and integrations.
Dataset
A versioned collection of pairs of inputs and (optional) expected outputs.
Evaluation / Eval
An eval consists of a task, dataset, and scorer(s). Evaluations can be:
- Offline: run a task on a static dataset with scoring functions.
- Online: real-time scoring on production or test requests.
Experiment
An instance of an offline eval run. Scores a specific task run on a given dataset.
Human review
An option to route evaluations or tasks to human reviewers instead of, or in addition to, automated scorers.
Log
An instance of a live production or test interaction. Logs can include inputs, outputs, expected values, metadata, errors, scores, and tags. Scorers can also be applied to live logs to conduct online evaluations.
Loop
An AI assistant in the Braintrust UI that can help you with evaluation-related tasks, like optimizing prompts and generating dataset rows.
Metric
A quantitative measure of model performance (for example, accuracy, latency, or cost) tracked over time and across experiments.
Model
An AI system (typically an LLM) that can be evaluated or monitored with Braintrust. Models can be first-party, third-party, or open-source.
Organization
Your company or team “home” in Braintrust. It holds all your projects, members, and settings.
OTEL
OpenTelemetry: the instrumentation standard Braintrust uses to collect and export trace and span data from integrations.
Playground
An interactive space where you can prototype, iterate on, and compare multiple prompts and models against a dataset in real time. A playground can be saved as an experiment.
Project
A container for related experiments, datasets, and logs. Use projects to segment work by feature, environment (dev/prod), or team.
Prompt
The instruction given to an AI model. Prompts are editable objects you can version and reuse across experiments and playgrounds.
Prompt engineering
The practice of designing, optimizing, and refining prompts to improve AI model outputs and performance.
Regression testing
Evaluations that ensure new model or prompt configurations maintain or improve upon previous performance benchmarks.
Remote eval
An evaluation that is executed on external or third-party systems or services, allowing you to evaluate tasks in environments outside Braintrust.
Scorer
The component responsible for judging the quality of AI outputs. Scorers may be:
- Rule-based code
- LLM-based prompts as judges
- Human reviewers
Setting
An organization-level preference or control, including user management, billing, and global integrations.
Span
A single segment within a trace, representing one operation (for example, a model call or tool execution) with its timing and metadata.
Structured output
A defined format (for example, JSON or XML) that models must follow, enabling consistent parsing and scoring of responses.
Task
A single unit of work, typically composed of an input, output, expected result, and evaluation. Tasks often appear within dataset or eval detail screens.
Trace
An individual recorded session detailing each step of an interaction: model calls, tool invocations, and intermediate outputs. Traces aid debugging and root-cause analysis.
User feedback
End-user inputs and ratings collected from production that inform model performance tracking and future evals.