How Browserbase helps customers build reliable browser agents

Browserbase provides the infrastructure that lets AI browse the web. When agents need to read website data, fill in forms, or interact with pages, companies use Browserbase to give their AI browsing capabilities. Paul Klein IV, Browserbase's founder and CEO, and his engineering team use evals and observability to tackle the unique challenges of building reliable browser agents at scale.

The challenge: the web is vast and constantly changing

Building browser agents that work reliably is exceptionally difficult. Websites change constantly, popups appear unexpectedly, cookie banners alter page layouts, and sales events rearrange entire workflows. Each of these variations can derail an agent mid-task.

There are so many things that can go wrong, so many things that can change. Measuring how agents can perform on the internet despite all these changes is even more complex.

For browser agents specifically, developers care about speed, reliability, and cost. Inference is often the biggest bottleneck due to token costs. Teams then need to choose the right model for processing entire HTML pages quickly. Finally, large amounts of data need to be processed for an agent to identify the right page elements to interact with, and where and how to take the correct action.

Defining reliability for browser agents

Robustness and reliability are of critical importance for building browser agents. Browserbase is constantly evaluating whether their agents are performing fast and cost efficiently, but most importantly whether or not they are reliable.

At Browserbase, reliability for a browser agent means getting three things right in sequence:

Element identification

Interaction

Trajectory

Element identification: Given a prompt, can the agent find the correct button, link, or input on the page?
Interaction: Can it actually interact with that element correctly? Some buttons require a simple click, others need a click followed by a dropdown, a scroll, or another sequence. The web has countless interaction patterns.
Trajectory: After interacting, does the agent take the correct next step? Clicking the right button is not enough if the agent then proceeds down the wrong path.

This is what really matters to Browserbase's customers, and this is what Browserbase uses Braintrust to observe and understand.

For us, observability over the browser is extremely important because it helps our customers, but also ourselves build better products.

Combining browser and model observability

Browserbase thinks of observability as a form of session replay that shows exactly what a browser agent was doing, including the websites it visited, network requests, console logs, and every action taken. But identifying what the model was thinking behind those actions requires a different kind of observability.

What we're often missing is, what was the model thinking? What were the model's answers? That's where observability platforms like Braintrust come in, to help us see the input and output to a model.

When Browserbase customers combine internal model observability with external browser observability, they can build significantly more reliable browser agents. And once their agents are performing reliably, they can go deeper on questions like whether they are building the best browser agent possible for their task.

Without model observability

Browser session replay only

No visibility into model reasoning

Can't diagnose why actions were taken

Benchmarks without custom evals

Can't pinpoint model vs. browser failures

Without model observability

Browser session replay only

No visibility into model reasoning

Can't diagnose why actions were taken

Benchmarks without custom evals

Can't pinpoint model vs. browser failures

With Braintrust

Combined browser + model observability

Full visibility into model input/output

Trace reasoning behind every action

Public benchmarks + custom evals

100,000s of public eval dashboard views

Stagehand: open source evals for the community

Browserbase maintains Stagehand, an open source SDK that provides a unified tool interface for agents to control browsers in TypeScript, Python, Go, and other languages. Alongside Stagehand, the team publishes benchmarks and evals powered by Braintrust that show which models perform best at browser tasks.

This is an important value for our customers because they don't know which model to choose. When we run all of these benchmarks and evals against Stagehand, it makes it really easy for our customers to know which model's going to work best.

The public eval dashboard has received hundreds of thousands of views. When new computer-use models launch, they appear on the Stagehand eval page immediately.

But Browserbase is clear that benchmarks are a starting point, not the destination. Customers are encouraged to build their own evals tailored to their specific use cases.

What's next: long-context browsing

One trend Browserbase has observed is that browsing session durations have grown exponentially. AI is browsing longer than ever, directly correlated with improvements in model quality.

As the models improve, the average white collar worker might control 20 different browsing agents operating at the same time, giving them the leverage to do more work. This means that the next challenge for browser agents is long context reasoning. Browserbase is working on the infrastructure to make that possible.

How do we let agents work for hours and hours on the internet? That's going to require much better models, different types of context engineering techniques, and even better infrastructure.

Key takeaways

Connect evals to observability. Building reliable browser agents requires measuring both what the model decided and what the agent actually did.
Publish public benchmarks. Transparent evals help the entire community choose the right models.
Encourage custom evals. Off-the-shelf benchmarks are a starting point. Real quality comes from evals tailored to specific use cases.
Define reliability in layers. For browser agents, reliability means getting element identification, interaction, and trajectory all correct in sequence.

Thank you to Paul for sharing Browserbase's story.

The challenge: the web is vast and constantly changing

Defining reliability for browser agents

Combining browser and model observability

Without model observability

Without model observability

With Braintrust

Stagehand: open source evals for the community

What's next: long-context browsing

Key takeaways

Build reliable browser agents with observability

Read more customer stories

How Fintool generates millions of financial insights

How Notion evaluates AI at scale across 70 engineers

How Dropbox built an evaluation pipeline for AI search

Trace everything