Careers

Customer Reliability Manager

San Francisco, New York City

About the company

Braintrust is the AI observability platform. By connecting evals and observability in one workflow, Braintrust gives builders the visibility to understand how AI behaves in production and the tools to improve it.

Teams at Notion, Stripe, Zapier, Vercel, and Ramp use Braintrust to compare models, test prompts, and catch regressions — turning production data into better AI with every release.

About the Role

At Braintrust, exceptional support is one of our most important strategic advantages. Support is part of Engineering at Braintrust and exists to help reduce friction in the deployment and operation of our product. Our customers are developers building LLM-powered applications, and they move fast. We win by helping them move faster.

We’re looking for a manager to build and lead a team of highly senior and knowledgeable Customer Reliability Engineers to provide ambitiously high quality support focused on customer infrastructure. This team is responsible for reducing friction associated with Braintrust's various deployment models (hybrid, BYOC, and SaaS Enterprise). Engineers on this team directly scope and attempt fixes for infrastructure issues, manage high-stakes customer environments, and ensure product reliability across all customer deployment types.

This role blends engineering leadership, deployment expertise, and customer experience. If you love upleveling Senior+ level talent, scaling cutting edge and complex support motions, and reducing pain for developers, we’d love to talk with you.

What You’ll Do

  • Lead and grow a team of Customer Reliability Engineers, delivering reliable, high-touch support across all Braintrust deployment models: hybrid, Bring Your Own Cloud (BYOC), and enterprise SaaS

  • Own the primary after-hours on-call rotation for customer-reported SEV1s, with backup coverage from Customer Solution Architects (CSAs) and Developer Support Engineers.

  • Run incident response and escalation, including enabling customer infrastructure teams while jumping in hands-on for the highest-severity issues.

  • Own day-to-day tickets tied to deployments, upgrades, and performance troubleshooting.

  • Triage and scope deployment-related feature requests and bug reports, attempt fixes when feasible, and route custom work to Professional Services when needed.

  • Lead new BYOC deployments and upgrades.

  • Respond to high-severity alerts for BYOC customers.

  • Validate each new data plane release against the standard hybrid deployment, and partner with Docs to ship upgrade guidance alongside the changelog.

  • Coach and mentor the team on infrastructure debugging, deployment best practices, and strong customer ownership.

  • Synthesize customer feedback and operational trends for Product and Engineering to improve reliability and reduce recurring pain points.

You Might Be a Fit If You

  • Have 5–10+ years of experience leading support for developer-facing products.

  • Deeply familiar with deploying Terraform, Helm, and Kubernetes based infrastructure across major cloud providers.

  • Are comfortable reviewing, debugging, and reasoning about backend services, infrastructure, and deployment configurations.

  • Take ownership of customer-impacting issues end-to-end, ensuring accountability, follow-through, and continuous improvement.

  • Communicate clearly and empathetically, especially when navigating ambiguity or high-stakes customer situations.

  • Are deeply curious about LLM use cases and excited to lead teams building cutting edge support systems for AI products that are measurable, reliable, and trustworthy.

Bonus Points For

  • Familiarity with OpenAI, Anthropic, or similar LLM providers at a systems or integration level.

  • Experience guiding teams working with datasets, evaluation metrics, or prompt engineering.

  • A track record of building or scaling support tooling, documentation programs, or product-led growth initiatives.

  • Experience as a senior technical leader or tech lead in a high-growth startup environment.

  • History of partnering hands on with Engineering on production fixes for backend services, SDKs, or infrastructure.

  • Experience leading support for products with self-hosted offerings (e.g., Terraform, Kubernetes) and comfort leading incident response involving customer owned containerized environments.

Benefits include

  • Medical, dental, and vision insurance

  • Daily lunch, snacks, and beverages

  • Flexible time off

  • Competitive salary and equity

  • AI Stipend

Equal opportunity

Braintrust is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Apply

Trace everything