Loop
Loop is an AI assistant in Braintrust playgrounds, experiments, datasets, and logs.
In playgrounds, it helps you optimize and generate prompts, datasets and evals. On the experiments page, it helps you read and interpret the experiments in a project. In datasets, you can generate and edit datapoint rows at scale. In logs, it helps you find analytical insights about your project.
Loop is in public beta and is off by default. To turn it on, flip the feature flag in your settings. If you are on a hybrid deployment, Loop is available starting with v0.0.74
.
Selecting a model
Loop uses the AI models available in your Braintrust account via the Braintrust API Proxy. We currently support the following models:
- claude-4-sonnet
- claude-4.1-opus
- gpt-5
- gpt-4.1
- o3
- o4-mini
- claude-3-5-sonnet
To choose a model, navigate to the gear icon in the Loop chat window and select from the list of available models.
Available tools
Loop currently has the following tools. Tool availability changes based on the page you are viewing:
- Get summarized results: fetch summarized data of current page contents
- Get detailed results: retrieve detailed data of current page contents (evaluation results, dataset rows, ...etc)
- Edit prompt: generate and modify prompts in the playground
- Run eval: Execute evaluations in the playground
- Edit data: Generate and modify datasets
- Get scorers: Get all available scorers in the project
- Edit scorers: Edit scorer selection in the playground
- Create code scorer: Create or edit code-based scorer
- Create LLM judge scorer: Create or edit LLM judge scorer
- BTQL query: Generate and run btql query against project logs
- Infer schema: Inspect project logs and create an understanding of the shape of the data
- Continue execution: Resume tasks after Loop has run out of iteration
You can remove any of these tools from your Loop workflow by selecting the gear icon and deselecting a tool from the available list.
Generating and optimizing prompts
Loop can help you generate a prompt from scratch. To do so, make sure you have an empty task open, then use Loop to generate a prompt.
If you have existing prompts, you can optimize them using Loop.
To optimize a prompt, ask Loop in the chat window, or select the Loop icon in the top bar of any existing task. From there, you can add the prompt to your chat, or quick optimize.
After Loop provides a suggested optimization, you can review and accept the suggestion or keep iterating.
Generating and optimizing datasets
If no dataset exists, Loop can create one automatically. You must have a task in order for Loop to generate a tailored dataset for the evaluation task.
You can review the dataset and further refine it as needed.
After you run your playground, you can also ask Loop to optimize your dataset. The agent will provide various areas for optimizations based on an analysis of your current dataset.
Analyze project logs
Loop can understand the shape of your project's logs data and make arbitrary queries to answer questions about your logs data. This ability can be used to find analytical insights or used in conjunction with Loop's other abilities.
For analytical insights, you can ask things like "what are the most common errors", "what are the most common inputs from users", and "what user retention trends do you see?" and Loop will gather the necessary data from your logs to answer your question.
For using this in conjunction with Loop's other abilities, you might navigate to the dataset page and ask Loop, "Can you find the most common errors users face and generate dataset rows based on the findings? Follow the formatting of existing rows you see in this dataset", and Loop will gather the context necessary from logs and generate your dataset based on the findings.
Generating and editing scorers
If no scorers exist, Loop can create one for you. You must have a dataset and a task in order for Loop to generate a scorer that is specific to your use case. The agent will begin by checking what data you have, what existing scorers are available, and fetching some sample results to understand the data structure.
If you select Accept, the new scorer will be added to the playground.
Loop can also help you improve and edit existing scorers.
Tune scorers based on target classification
Loop can take manually labelled target classification from evaluations in the playground and adjust scorer classification behavior.
Select the rows that the scorers did not perform expectedly on, then select Tune scorer.
Select the desired classification, provide optional additional instruction and submit to Loop to tune the scorer. Loop will adjust the scorer based on the provided context.
Run and assess evals
After your tasks, dataset, and scorers are set up, Loop can run an evaluation for you, analyze it, and suggest further improvements.
Analyze and interpret your experiments
Loop can read the results of your experiment(s), summarize the results, and help discover new insights.
Mode
By default, Loop will ask you for confirmation before executing certain tool calls, like running an evaluation. If you'd like Loop to run evaluations without confirmation, you can turn off this setting in the agent mode menu.
Continuous agent
In continuous agent mode, Loop will execute tools and make edit suggestions one after the other.