Self-hosting Braintrust

Braintrust offers a self-hosted deployment option that separates data storage from platform management. You deploy and control the infrastructure that stores your sensitive AI data, while Braintrust provides the managed UI, authentication, and platform updates. This gives you full control over your data without the operational overhead of running the entire platform.

Use cases

Self-hosting is designed for organizations with specific requirements:

Data residency and compliance: Meet regulatory or contractual obligations by keeping all customer data (experiment logs, traces, datasets, and prompts) within your own cloud account and region.
Security posture and isolation: Deploy the data plane behind your firewall or VPN, using your own IAM policies, KMS encryption keys, and audit trails. This ensures sensitive data never traverses external networks.
Access to private resources: Connect to internal LLM models, proprietary tools, or private APIs that are not accessible from the public internet. The data plane runs within your network and can access resources in your VPC or private network.

How it works

Braintrust’s architecture has two main components:

The data plane stores all sensitive data, including experiment records, logs, traces, spans, datasets, and prompt completions. It consists of the Braintrust API, a PostgreSQL database, Redis cache, object storage, and Brainstore (a high-performance query engine for real-time trace ingestion).
The control plane provides the web UI, authentication, user management, and metadata storage (project names, experiment names, organization settings). The control plane does not store or process your sensitive data.

Breakdown of where data is stored

Data	Location
Experiment records (input, output, expected, scores, metadata, traces, spans)	Data plane
Log records (input, output, expected, scores, metadata, traces, spans)	Data plane
Dataset records (input, output, metadata)	Data plane
Prompt playground prompts	Data plane
Prompt playground completions	Data plane
Human review scores	Data plane
Project-level LLM provider secrets (encrypted)	Data plane
Org-level LLM provider secrets (encrypted)	Control plane
API keys (hashed)	Control plane
Experiment and dataset names	Control plane
Project names	Control plane
Project settings	Control plane
Git metadata about experiments	Control plane
Organization info (name, settings)	Control plane
Login info (name, email, avatar URL)	Control plane
Auth credentials	Clerk

When you self-host Braintrust, you deploy the data plane in your own infrastructure using Terraform. On AWS, this uses Lambda functions and EC2 instances. On GCP and Azure, this uses Kubernetes containers. Braintrust continues to host the control plane. When you use Braintrust’s SDKs, they send data directly to your data plane. When you use the web UI, your browser communicates directly with your data plane via CORS. The control plane and data plane communicate only for authentication and metadata synchronization. Braintrust’s servers and employees do not require access to your data plane for it to operate. When you configure your self-hosted data plane URL in organization settings, Braintrust automatically provisions a service token with the necessary permissions. This streamlines setup and enables features like data retention without manual configuration steps.

Deployment options

Braintrust provides official Terraform modules for self-hosting on AWS, Google Cloud Platform (GCP), and Azure:

AWS: Terraform with Lambda and EC2
GCP: Terraform with Kubernetes and Helm
Azure: Terraform with Kubernetes and Helm

Braintrust strongly recommends using these Terraform modules because they are kept up-to-date with best practices, mirror the fully hosted offering (proven at scale), minimize configuration issues, and ensure Braintrust can efficiently troubleshoot performance and operational issues. If the module conflicts with your organization’s infrastructure standards, you can deploy Braintrust in a dedicated cloud account or project to address these concerns. If this approach does not work for your situation, contact Braintrust to discuss possible modifications to the modules.

Legacy customers: If you previously deployed using AWS CloudFormation, the CloudFormation guide remains available. This deployment method is not supported for new customers.

Shared responsibility

When you self-host, uptime becomes a shared responsibility between your team and Braintrust:

Braintrust is responsible for responding quickly when you have issues, collaboratively resolving them with you, and fixing bugs to improve quality.
Your team is responsible for following the documentation, assigning infrastructure resources on your team, and ensuring that in the event of an incident, you have staff who are familiar with Braintrust and can work with the Braintrust team to share context and resolve issues.

Monitoring

By default, your self-hosted data plane automatically sends the following telemetry back to the Braintrust-managed control plane:

Health check information
System metrics (CPU/memory) and Braintrust-specific metrics like indexing lag
Billing usage telemetry for aggregate usage metrics

This allows Braintrust to monitor key health indicators and quickly identify issues before they cause downtime. In some cases, Braintrust may ask you to enable additional telemetry to help with troubleshooting, including logs and traces. For more details, see Enable or disable telemetry.

Upgrades

Braintrust releases new versions of the data plane around once per week, often with incremental changes that improve the performance of Brainstore, add support for new features, and improve logging. You can find the details of each data plane release on the Data plane changelog. Braintrust recommends that you update monthly, but you must update at least once per quarter. If you require support, either to diagnose an issue or improve a feature, Braintrust may ask you to upgrade to the latest version as a first step. Braintrust does not have specific long-term releases at this point, and the team is best equipped to support the latest version. To check which data plane version you’re currently running, go to Settings > Organization > Data plane. For platform-specific upgrade instructions, see:

Remote access

There are occasionally issues that require ad-hoc debugging or running manual commands against containers, the Postgres database, or storage buckets to repair the state of the system. Customers who provide Braintrust with remote access (as needed) have experienced much faster resolutions when such issues occur, because the Braintrust team can connect directly and resolve issues. If this is not possible, factor this into your uptime calculations. If uptime of Braintrust is a key metric for you, strongly consider making remote access available to the Braintrust team as needed. If you cannot set up remote access, ensure that you can swiftly access:

Containers directly (to update them, view logs, restart them, and view host metrics like CPU, network, memory, and disk utilization)
Postgres to run SQL queries
Redis to run commands
Storage buckets to run read, write, and list commands

Your on-call staff should have basic familiarity with Braintrust and the ability to perform all of these operations.

Hardware requirements

When deploying Braintrust in production, consider these hardware requirements for reliable performance and uptime. These requirements assume typical production usage patterns. For high-utilization deployments, you may need to scale these resources up significantly. Monitor your resource utilization and adjust accordingly.

API service

This section applies to GCP and Azure with Kubernetes. AWS deployments use Lambda functions, which are managed automatically and do not require manual resource configuration.

Resource	Testing/Staging	Production
CPU	1 vCPU	2+ vCPUs per instance
Memory	2GB RAM	8GB+ RAM
Instance count	1	4+

Environment variables:

NODE_MEMORY_PERCENT: Set to 80-90 if the API is running on a dedicated instance or container orchestrator with cgroup memory limits (e.g. Kubernetes, ECS).

Database (PostgreSQL)

Resource	Testing/Staging	Production
CPU	2 vCPUs	8+ vCPUs
Memory	8GB RAM	64GB+ RAM
Storage size	100GB	1000GB+ (monitor for growth)
Storage IOPS	3,000	15,000+
Version	15+	17+

Redis cache

Resource	Testing/Staging	Production
CPU	1 vCPU	2 vCPUs
Memory	1GB RAM	4GB+ RAM
Version	7+	7+

Important for AWS: Avoid using burstable Redis instances (t-family instances like cache.t4g.micro) in production. These instances use CPU credits that can be exhausted during high-load periods, leading to performance throttling.Instead, use non-burstable instances like cache.r7g.large, cache.r6g.medium, or cache.r5.large for predictable performance. Even if these instances seem oversized initially, they provide consistent performance without the risk of CPU credit exhaustion.

Brainstore

Resource	Testing/Staging	Production
CPU	4 vCPUs	16+ vCPUs (ARM recommended)
Memory	8GB RAM	32GB+ RAM
Storage size	128GB	1024GB+
Storage type	SSD	NVMe (ephemeral)
Storage IOPS	—	150,000+ read/write
Node types	Combined reader/writer	Separate readers and writers
Instance count	1	2+ readers, 1+ writers

Important

Brainstore requires separate reader and writer nodes for reliability and performance. Plan for a minimum of 2 reader nodes to ensure high availability. A single writer node is sufficient since writers can tolerate brief downtimes and do not service interactive user requests.
Brainstore requires high-performance storage with at least 150,000 IOPS for both reads and writes. Use NVMe-based ephemeral storage (the storage does not need to be persistent). Do not use EBS volumes or other slower storage options like Azure’s standard local disks, as these will significantly degrade performance.
For Kubernetes deployments (GCP and Azure), each Brainstore pod must run on its own dedicated node to ensure optimal performance and resource isolation.

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

Self-hosting Braintrust

Use cases

How it works

Deployment options

Shared responsibility

Monitoring

Upgrades

Remote access

Hardware requirements

API service

Database (PostgreSQL)

Redis cache

Brainstore

Start

Instrument

Observe

Annotate

Evaluate

Deploy

Admin

Best practices

​Use cases

​How it works

​Deployment options

​Shared responsibility

​Monitoring

​Upgrades

​Remote access

​Hardware requirements

​API service

​Database (PostgreSQL)

​Redis cache

​Brainstore

Use cases

How it works

Deployment options

Shared responsibility

Monitoring

Upgrades

Remote access

Hardware requirements

API service

Database (PostgreSQL)

Redis cache

Brainstore