With Paul Klein IV, Founder & CEO
Browserbase provides the infrastructure that lets AI browse the web. When agents need to read website data, fill in forms, or interact with pages, companies use Browserbase to give their AI browsing capabilities. Paul Klein IV, Browserbase's founder and CEO, and his engineering team use evals and observability to tackle the unique challenges of building reliable browser agents at scale.
Building browser agents that work reliably is exceptionally difficult. Websites change constantly, popups appear unexpectedly, cookie banners alter page layouts, and sales events rearrange entire workflows. Each of these variations can derail an agent mid-task.
There are so many things that can go wrong, so many things that can change. Measuring how agents can perform on the internet despite all these changes is even more complex.
For browser agents specifically, developers care about speed, reliability, and cost. Inference is often the biggest bottleneck due to token costs. Teams then need to choose the right model for processing entire HTML pages quickly. Finally, large amounts of data need to be processed for an agent to identify the right page elements to interact with, and where and how to take the correct action.
Robustness and reliability are of critical importance for building browser agents. Browserbase is constantly evaluating whether their agents are performing fast and cost efficiently, but most importantly whether or not they are reliable.
At Browserbase, reliability for a browser agent means getting three things right in sequence:

This is what really matters to Browserbase's customers, and this is what Browserbase uses Braintrust to observe and understand.
For us, observability over the browser is extremely important because it helps our customers, but also ourselves build better products.
Browserbase thinks of observability as a form of session replay that shows exactly what a browser agent was doing, including the websites it visited, network requests, console logs, and every action taken. But identifying what the model was thinking behind those actions requires a different kind of observability.
What we're often missing is, what was the model thinking? What were the model's answers? That's where observability platforms like Braintrust come in, to help us see the input and output to a model.
When Browserbase customers combine internal model observability with external browser observability, they can build significantly more reliable browser agents. And once their agents are performing reliably, they can go deeper on questions like whether they are building the best browser agent possible for their task.

Can't pinpoint model vs. browser failures
Browserbase maintains Stagehand, an open source SDK that provides a unified tool interface for agents to control browsers in TypeScript, Python, Go, and other languages. Alongside Stagehand, the team publishes benchmarks and evals powered by Braintrust that show which models perform best at browser tasks.
This is an important value for our customers because they don't know which model to choose. When we run all of these benchmarks and evals against Stagehand, it makes it really easy for our customers to know which model's going to work best.
The public eval dashboard has received hundreds of thousands of views. When new computer-use models launch, they appear on the Stagehand eval page immediately.
But Browserbase is clear that benchmarks are a starting point, not the destination. Customers are encouraged to build their own evals tailored to their specific use cases.
One trend Browserbase has observed is that browsing session durations have grown exponentially. AI is browsing longer than ever, directly correlated with improvements in model quality.
As the models improve, the average white collar worker might control 20 different browsing agents operating at the same time, giving them the leverage to do more work. This means that the next challenge for browser agents is long context reasoning. Browserbase is working on the infrastructure to make that possible.
How do we let agents work for hours and hours on the internet? That's going to require much better models, different types of context engineering techniques, and even better infrastructure.
Thank you to Paul for sharing Browserbase's story.
Learn how Braintrust helps teams like Browserbase combine model observability with browser observability to measure element identification, interaction, and trajectory at scale.