Latest news

Braintrust Java SDK: AI observability and evals for the JVM

23 October 2025Andrew Kent

Java developers are building LLM applications across banking, healthcare, and enterprise software, but most AI observability and evaluation tools only target Python or TypeScript. The few JVM options either lack AI-specific features or require rebuilding your monitoring stack from scratch.

We built the Braintrust Java SDK to fix this. It's an open-source SDK for AI observability and evaluation that runs on Java 17+, built on OpenTelemetry so it fits into existing infrastructure.

Star us on GitHub

The problem

If you're building LLM features in Java, you've likely run into these issues:

  • Tracking LLM calls in production requires custom instrumentation to capture inputs, outputs, latency, token usage, and costs per request
  • Testing prompt changes or model swaps means either manual QA or writing custom test harnesses that don't integrate with existing eval tools
  • A/B testing prompts requires building feature flags, routing logic, and result tracking from scratch
  • Most AI observability tools target Python/TypeScript and don't provide Java clients or JVM-compatible instrumentation

What we built

The SDK provides AI observability and evaluation for Java 17+ applications. It requires Java 17 or higher and uses modern language features (records, pattern matching, and more) where appropriate.

What's included:

  • OpenTelemetry-based tracing built on OpenTelemetry spans and traces, not proprietary instrumentation. Export traces to Braintrust, Datadog, Honeycomb, or any OTLP-compatible backend. Fit alongside existing OpenTelemetry setups without conflicts using standard OTLP conventions for semantic attributes.
  • Wrappers for OpenAI and Anthropic clients that automatically instrument LLM calls. Instrumentation is opt-in per client, so your existing Java services don't change unless you explicitly wrap AI clients.
  • An evaluation framework that runs in CI/CD with support for custom scorers
  • Support for fetching prompts from Braintrust, managing datasets, and viewing traces in the UI

What you can do with it

Track LLM calls in production: Every instrumented call captures input/output, latency, token counts, and costs. When debugging production issues, you can filter traces by metadata, search through prompts and responses, and see exactly what the model received and returned.

Run evals in CI/CD: Write test cases with expected outputs, define custom scoring functions, and run them on every commit. When you change a prompt or switch models, the eval framework shows which test cases passed, which failed, and aggregate scores across your test suite.

Fetch prompts from Braintrust: Instead of hardcoding prompts in your application, store them in Braintrust and fetch them at runtime. This lets you iterate on prompts without redeploying code and makes A/B testing different versions straightforward.

Getting started

Here's how to instrument an OpenAI client:

java
Braintrust braintrust = Braintrust.get();
OpenTelemetry openTelemetry = braintrust.openTelemetryCreate();
OpenAIClient oaiClient = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv());

// Use the client as normal
var response = oaiClient.chat().completions().create(
    ChatCompletionCreateParams.builder()
        .model(ChatModel.GPT_4O_MINI)
        .addUserMessage("Explain quantum computing")
        .build()
);

Every OpenAI call now flows through OpenTelemetry instrumentation, capturing inputs, outputs, latency, token usage, and costs.

To run an evaluation, define your task, test cases, and scoring functions:

java
var braintrust = Braintrust.get();
var openTelemetry = braintrust.openTelemetryCreate();
var openAIClient = BraintrustOpenAI.wrapOpenAI(openTelemetry, OpenAIOkHttpClient.fromEnv());

// Define your task
Function<String, String> getFoodType = (String food) -> {
    var request = ChatCompletionCreateParams.builder()
        .model(ChatModel.GPT_4O_MINI)
        .addSystemMessage("Return a one word answer")
        .addUserMessage("What kind of food is " + food + "?")
        .maxTokens(50L)
        .temperature(0.0)
        .build();
    var response = openAIClient.chat().completions().create(request);
    return response.choices().get(0).message().content().orElse("").toLowerCase();
};

// Define your eval
var eval = braintrust.<String, String>evalBuilder()
    .name("food-classification-eval")
    .cases(
        EvalCase.of("asparagus", "vegetable"),
        EvalCase.of("banana", "fruit"),
        EvalCase.of("chicken", "protein"))
    .task(getFoodType)
    .scorers(
        Scorer.of("fruit_scorer",
            result -> "fruit".equals(result) ? 1.0 : 0.0),
        Scorer.of("vegetable_scorer",
            result -> "vegetable".equals(result) ? 1.0 : 0.0))
    .build();

// Run it
var result = eval.run();
System.out.println(result.createReportString());

This produces a detailed report showing per-case scores, aggregate metrics, and links to the Braintrust UI where you can drill into individual traces. Run this in CI/CD to catch regressions.

What's next

We're excited to support the AI developers building with Java.

For more examples, check out the README. The artifact is available on Maven Central. If you run into issues or have questions, please let us know on Discord.

Bring structure to your AI agent development