Customers
BROWSERBASE

How Browserbase helps customers build reliable browser agents

With Paul Klein IV, Founder & CEO

100,000s
Public eval dashboard views
Talk to sales

Browserbase provides the infrastructure that lets AI browse the web. When agents need to read website data, fill in forms, or interact with pages, companies use Browserbase to give their AI browsing capabilities. Paul Klein IV, Browserbase's founder and CEO, and his engineering team use evals and observability to tackle the unique challenges of building reliable browser agents at scale.

The challenge: the web is vast and constantly changing

Building browser agents that work reliably is exceptionally difficult. Websites change constantly, popups appear unexpectedly, cookie banners alter page layouts, and sales events rearrange entire workflows. Each of these variations can derail an agent mid-task.

There are so many things that can go wrong, so many things that can change. Measuring how agents can perform on the internet despite all these changes is even more complex.

For browser agents specifically, developers care about speed, reliability, and cost. Inference is often the biggest bottleneck due to token costs. Teams then need to choose the right model for processing entire HTML pages quickly. Finally, large amounts of data need to be processed for an agent to identify the right page elements to interact with, and where and how to take the correct action.

Defining reliability for browser agents

Robustness and reliability are of critical importance for building browser agents. Browserbase is constantly evaluating whether their agents are performing fast and cost efficiently, but most importantly whether or not they are reliable.

At Browserbase, reliability for a browser agent means getting three things right in sequence:

Element identification
Interaction
Trajectory
  1. Element identification: Given a prompt, can the agent find the correct button, link, or input on the page?
  2. Interaction: Can it actually interact with that element correctly? Some buttons require a simple click, others need a click followed by a dropdown, a scroll, or another sequence. The web has countless interaction patterns.
  3. Trajectory: After interacting, does the agent take the correct next step? Clicking the right button is not enough if the agent then proceeds down the wrong path.

This is what really matters to Browserbase's customers, and this is what Browserbase uses Braintrust to observe and understand.

For us, observability over the browser is extremely important because it helps our customers, but also ourselves build better products.

Combining browser and model observability

Browserbase thinks of observability as a form of session replay that shows exactly what a browser agent was doing, including the websites it visited, network requests, console logs, and every action taken. But identifying what the model was thinking behind those actions requires a different kind of observability.

What we're often missing is, what was the model thinking? What were the model's answers? That's where observability platforms like Braintrust come in, to help us see the input and output to a model.

When Browserbase customers combine internal model observability with external browser observability, they can build significantly more reliable browser agents. And once their agents are performing reliably, they can go deeper on questions like whether they are building the best browser agent possible for their task.

Without model observability

Browser session replay only
No visibility into model reasoning
Can't diagnose why actions were taken
Benchmarks without custom evals

Can't pinpoint model vs. browser failures

Stagehand: open source evals for the community

Browserbase maintains Stagehand, an open source SDK that provides a unified tool interface for agents to control browsers in TypeScript, Python, Go, and other languages. Alongside Stagehand, the team publishes benchmarks and evals powered by Braintrust that show which models perform best at browser tasks.

This is an important value for our customers because they don't know which model to choose. When we run all of these benchmarks and evals against Stagehand, it makes it really easy for our customers to know which model's going to work best.

The public eval dashboard has received hundreds of thousands of views. When new computer-use models launch, they appear on the Stagehand eval page immediately.

But Browserbase is clear that benchmarks are a starting point, not the destination. Customers are encouraged to build their own evals tailored to their specific use cases.

What's next: long-context browsing

One trend Browserbase has observed is that browsing session durations have grown exponentially. AI is browsing longer than ever, directly correlated with improvements in model quality.

As the models improve, the average white collar worker might control 20 different browsing agents operating at the same time, giving them the leverage to do more work. This means that the next challenge for browser agents is long context reasoning. Browserbase is working on the infrastructure to make that possible.

How do we let agents work for hours and hours on the internet? That's going to require much better models, different types of context engineering techniques, and even better infrastructure.

Key takeaways

  • Connect evals to observability. Building reliable browser agents requires measuring both what the model decided and what the agent actually did.
  • Publish public benchmarks. Transparent evals help the entire community choose the right models.
  • Encourage custom evals. Off-the-shelf benchmarks are a starting point. Real quality comes from evals tailored to specific use cases.
  • Define reliability in layers. For browser agents, reliability means getting element identification, interaction, and trajectory all correct in sequence.

Thank you to Paul for sharing Browserbase's story.

Build reliable browser agents with observability

Learn how Braintrust helps teams like Browserbase combine model observability with browser observability to measure element identification, interaction, and trajectory at scale.

Book a demo
Share

Trace everything