The infrastructure behind AI development: Why testing and observability matter

21 August 2025Sarah Zeng

A new layer of infrastructure for AI

We're watching a new layer of infrastructure emerge in real time—one that mirrors the rise of CI/CD, observability, and DevOps in the early days of modern software development. Except this time, the workloads aren't deterministic programs; they're probabilistic systems driven by LLMs that produce different results every time. Just as CI/CD pipelines became standard in software development, systematic AI evaluation is becoming a baseline requirement for AI development.

As AI products become more deeply integrated into business-critical workflows, the question is no longer, "Can you build an AI product?" but instead, "Can you trust what you've built, measure how it is performing, and iterate with confidence?"

That's where testing (evals) and observability come in.

Why traditional software practices break down

Software testing traditionally relies on deterministic behavior. You write unit tests, integration tests, and end-to-end tests with the assumption that given the same inputs, your program will produce the same outputs.

LLMs break that assumption for a few reasons:

Non-determinism: LLMs can produce many plausible outputs for the same input. For instance, if you typed "Hello, world" into ChatGPT twice, you would most likely get different responses. This makes traditional testing frameworks inadequate.

Subjectivity: "Correctness" often depends on human judgment. You can't always write an assert statement for relevance, helpfulness, or tone.

Multi-step workflows: AI apps frequently chain multiple steps—retrieval, reasoning, summarization, tool use. Consider an AI research assistant that must understand ambiguous queries, search multiple data sources, synthesize conflicting information, and cite sources appropriately. Each component can fail independently, requiring evaluation frameworks that assess both final outputs and intermediate reasoning steps. Failures can cascade across the pipeline.

Scale of inputs: Real-world usage involves millions of varied prompts. Teams need infrastructure to track, test, and understand that volume.

The result is that engineers start to over-index on manual QA (slow, expensive) or deploy blindly and hope nothing breaks.

What AI teams are doing today (and why it doesn't scale)

Manual review in spreadsheets
Gut-feel shipping based on a few hand-checked examples
Custom in-house tools that don't generalize

None of these approaches are scalable (the exception is perhaps you could build a very robust in-house solution, but why do that when you should focus on, as Jeff Bezos says, what makes your beer taste better?). They slow down iteration, fail to catch regressions, and erode trust in the system. Worse, they can create infrastructure debt that compounds as AI systems become more complex.

The first generation of prompt management, evals, and observability solutions helped alleviate some of the burden off of AI teams but involved many problems like reliability and scalability (AI workloads are magnitudes larger than their traditional counterparts, meaning that these v0 and v1 solutions were constantly breaking).

The emergence of evaluation infrastructure

This means that reliable, structured evaluation has become one of the major bottlenecks for shipping reliable AI products. This presents a category-defining opportunity: we're at the start of a wave similar to companies like Datadog enabled for traditional engineering. I expect the most successful companies to be platform-agnostic, enabling teams to compare different models and providers rather than being locked into specific vendors. The top solutions will also strengthen organically over time. More data flowing through the system improves automated scoring, creating sustainable competitive moats similar to the vertical search companies I've written about previously.

In the AI era, this new layer is companies like Braintrust.

What is Braintrust? Braintrust is a systematic framework for testing, monitoring, and improving AI agents. The platform allows technical (via SDKs) and non-technical users (via the UI) alike to run test cases ("offline evals") on a dataset and also track the real-time performance of their AI applications in production ("online evals.") This observability piece is useful for teams to see how models behave in the wild and becomes particularly important in cases where there are multiple LLM calls chained together and you need to identify the one causing a problem.

A key element that differentiates Braintrust from the v0s and v1s of this infrastructure layer is that it is purpose-built to handle heavy AI development. I won't dive into too much detail in this blog post, but features like Brainstore (an eval-native database designed to support enormous AI workloads) and Loop (an AI agent that builds prompts, datasets, and evals) make it easy for teams to scale their evals and observability with their own applications.

Why this matters now

What if you think you are still too early in AI development adoption to need evals? What if you are perfectly happy with your existing solutions? The value proposition of adopting end-to-end platforms like Braintrust should still not be underestimated. Teams with systematic evaluation processes iterate faster and deploy more confidently than those relying on manual testing. This advantage compounds over time as AI systems become more complex.

Moreover, three forces are converging:

Production AI is here. From customer support agents to internal copilots, LLMs are powering real products.

Complexity is increasing. Chains, RAG, tools, fine-tuning, feedback loops—modern AI stacks are deep.

Reliability is non-negotiable. Enterprises and consumers won't tolerate systems that hallucinate, fail, or degrade without explanation.

This demands a new breed of infrastructure: testable, observable, explainable.

The bigger market opportunity

Zoom out and this is about more than AI testing — the question is, what will the next phase of infrastructure software look like?

CI/CD for AI: The next generation of DevOps could include evaluation pipelines that run on every model deployment.

Monitoring and observability: Logging prompt behavior and detecting performance changes in real time.

Collaborative debugging: Shared platforms for technical and non-technical folks to inspect model behavior together.

Model-agnostic development: Infrastructure that works across providers and foundation models.

Ultimately, reliable AI isn't just about picking the right model or tuning the right prompt. It's about building the scaffolding around your system so that every change is testable, observable, and improvable.

This is why evaluation infrastructure matters so much and why Braintrust is positioned to play a central role in the next generation of AI development. Just like CI/CD became the default for software, systematic evaluation and observability will redefine the space and become the default for AI development.