How to trace LLM applications in TypeScript (2026)

2 July 2026Braintrust Team15 min

TL;DR

LLM tracing is harder to implement cleanly in TypeScript when the application runs across Node, serverless, and edge runtimes. Without structured traces, teams can miss where an agent slowed down, which tool call failed, or which prompt change created a regression.

A useful LLM trace captures each model call, retrieval step, tool invocation, input, output, latency, token count, cost, and error as part of one typed span tree. That structure makes production behavior searchable, debuggable, and reusable as evaluation data.

This guide walks through tracing LLM and agent applications in TypeScript, including auto-instrumentation, Vercel AI SDK telemetry, nested tool spans, streaming, serverless flushing, and the path from production traces to eval datasets. Start free with Braintrust to trace TypeScript LLM apps and turn real usage into release checks.

What a good LLM trace captures

LLM observability matured first in Python, so TypeScript teams often need to check more than whether a JavaScript package exists. A useful tracing setup should provide a first-class TypeScript SDK, real types for metadata and span payloads, support for Node, and reliable behavior in edge or serverless runtimes such as Vercel Edge and Cloudflare Workers.

A production trace should answer specific questions about a single request. Without that structure, teams can see that a request failed, but they cannot isolate whether the failure came from retrieval, the model call, a tool invocation, or the runtime.

Step-level visibility: Each model call, retrieval step, and tool invocation should appear as its own span so the team can see the path the request followed.

Inputs and outputs: Prompts, messages, tool arguments, returned text, and structured outputs should be captured for every step that affects the final response.

Latency, tokens, and cost: Each span should show how long the step took, how many prompt and completion tokens it used, and the estimated cost of the model call.

Failure context: Errors should be attached to the span where they occurred, with the error type and message preserved so a broken tool call does not disappear inside a generic request failure.

Replay data: The trace should include enough request context to re-run the case later, turning each debugged request into reusable evaluation data.

Instrument LLM tracing with a TypeScript SDK

Auto-instrumentation is the fastest way to add tracing to a TypeScript LLM app. The SDK patches supported AI libraries at startup, so model calls are captured without wrapping every client call in your application code.

Install the SDK and your provider library, set your environment variables, and initialize the logger when the app starts.

typescript



// Call once at startup — all LLM calls are traced automatically
initLogger({
  apiKey: process.env.BRAINTRUST_API_KEY,
  projectName: "My Project (TypeScript)",
});

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await client.responses.create({
  model: "gpt-5-mini",
  input: "What is the capital of France?",
});

For auto-instrumentation to take effect in plain Node, run your app with the startup hook flag. The hook patches supported libraries before your application code loads.

bash

node --import braintrust/hook.mjs app.js

For bundled applications, use the matching bundler plugin. In Next.js, wrap the config once, and the setup works across Webpack on Next.js 15 and earlier and Turbopack on Next.js 16 and later.

typescript


const nextConfig = {};

export default wrapNextjsConfigWithBraintrust(nextConfig);

Manual instrumentation is useful when you want explicit control over which clients are traced or when a library is not covered by auto-instrumentation. For explicit tracing control, wrap the client directly.

typescript



initLogger({
  apiKey: process.env.BRAINTRUST_API_KEY,
  projectName: "My Project (TypeScript)",
});

// Wrap the OpenAI client to trace all calls
const client = wrapOpenAI(new OpenAI({ apiKey: process.env.OPENAI_API_KEY }));
const response = await client.responses.create({
  model: "gpt-5-mini",
  input: "What is the capital of France?",
});

The most useful early habit is adding typed metadata at the request entry point. Wrapping the handler in logger.traced lets you record request-scoped values such as user ID or org ID when the span is created, which makes production traces easier to filter later. Because the metadata object is plain typed TypeScript, a wrong field name can be caught before it becomes a silent gap in your logs.

typescript



const logger = initLogger({ projectName: "My Project" });
const openai = new OpenAI();

async function handleRequest(userId: string, orgId: string, prompt: string) {
  return logger.traced(
    async (span) => {
      const response = await openai.responses.create({
        model: "gpt-5-mini",
        input: prompt,
      });
      return response.output_text;
    },
    {
      event: {
        metadata: { userId, orgId },
        tags: ["handle-request"],
      },
    },
  );
}

await handleRequest("user-123", "org-456", "What is the capital of France?");

Trace agents, tools, and Next.js applications

A single model call produces a simple trace, but agent requests usually move through model calls, tool decisions, retrieval steps, and data transformations. To debug those requests, the trace needs to preserve the same hierarchy as the code. When one traced function calls another, Braintrust nests the child span under the parent span, so the final trace shows where each step ran and where a failure or delay occurred.

typescript


const logger = initLogger({ projectName: "My Project" });

const fetchData = wrapTraced(async function fetchData(query: string) {
  // Database query logic
  return await db.query(query);
});

const transformData = wrapTraced(async function transformData(data: any[]) {
  // Data transformation logic
  return data.map((item) => transform(item));
});

// Parent span containing child spans
const pipeline = wrapTraced(async function pipeline(input: string) {
  const data = await fetchData(input); // Child span 1
  const transformed = await transformData(data); // Child span 2
  return transformed;
});

// Creates a trace with nested spans:
// pipeline
// └─ fetchData
// └─ transformData
await pipeline("user query");

With the Vercel AI SDK, tool-using calls should capture both the model's decision to call a tool and the result returned by the tool. Start with a standard AI SDK call and Braintrust initialization.

typescript




initLogger({
  projectName: "My AI Project",
  apiKey: process.env.BRAINTRUST_API_KEY,
});

async function main() {
  const { text } = await generateText({
    model: openai("gpt-5-mini"),
    prompt: "What is the capital of France?",
  });
  console.log(text);
}

main().catch(console.error);

To capture tool behavior, add tools to the instrumented generateText call. Braintrust records tool definitions in span metadata and logs each tool execution as a child span with its inputs and outputs. The example below defines tool arguments with a zod (z) schema.

typescript

const { text } = await generateText({
  model: openai("gpt-5-mini"),
  prompt: "What is 127 multiplied by 49?",
  tools: {
    multiply: {
      description: "Multiply two numbers",
      inputSchema: z.object({
        a: z.number(),
        b: z.number(),
      }),
      execute: async ({ a, b }: { a: number; b: number }) => a * b,
    },
  },
});

A tool-using request often requires more than one model round trip. In the resulting trace, the generateText operation acts as the parent span, each doGenerate call appears as a model-call child span, and each tool execution appears as a separate child span. For example, a request can include one doGenerate call that decides to use multiply, the multiply execution span, and a second doGenerate call that uses the tool result to write the final answer. This hierarchy shows how many model calls the task required and which step introduced latency or failure.

For Next.js and React applications, instrument the server-side work. A browser click does not need client-side tracing when the model and tool calls run in a route handler or server action. Wrap the server handler with logger.traced, and Braintrust records the request as the root span while nested model and tool calls attach underneath it.

Trace with the Vercel AI SDK and OpenTelemetry

Teams already using the Vercel AI SDK can use its native OpenTelemetry support as the tracing on-ramp. The AI SDK can emit spans from model calls and tool execution, and Braintrust can receive those spans through its OpenTelemetry exporter. The setup has two parts. Register an exporter for the app, then enable telemetry on the AI SDK call you want to trace.

For a Next.js app, register an exporter that forwards spans to Braintrust. The parent field determines which Braintrust project receives the traced calls.

typescript



// In your instrumentation.ts file
export function register() {
  registerOTel({
    serviceName: "my-braintrust-app",
    traceExporter: new BraintrustExporter({
      parent: "project_name:your-project-name",
      filterAISpans: true, // Only send AI-related spans
    }),
  });
}

After the exporter is registered, enable telemetry on each AI SDK call by setting experimental_telemetry. Metadata attached here carries through to Braintrust on the resulting span, so request-specific context remains searchable later.

typescript

const result = await generateText({
  model: openai("gpt-5-mini"),
  prompt: "What is 2 + 2?",
  experimental_telemetry: {
    isEnabled: true,
    metadata: {
      query: "weather",
      location: "San Francisco",
    },
  },
});

Node applications without a framework can configure the OpenTelemetry NodeSDK directly and add the Braintrust span processor. This is the more direct path when the app does not use Next.js instrumentation.

First install the dependencies.

bash

npm install ai @ai-sdk/openai braintrust @braintrust/otel @opentelemetry/sdk-node @opentelemetry/sdk-trace-base zod

Then set up the OpenTelemetry SDK and call the AI SDK as normal.

typescript






const sdk = new NodeSDK({
  spanProcessors: [
    new BraintrustSpanProcessor({
      parent: "project_name:your-project-name",
      filterAISpans: true,
    }),
  ],
});

sdk.start();

async function main() {
  const result = await generateText({
    model: openai("gpt-5-mini"),
    messages: [
      {
        role: "user",
        content: "What are my orders and where are they? My user ID is 123",
      },
    ],
    tools: {
      listOrders: tool({
        description: "list all orders",
        parameters: z.object({ userId: z.string() }),
        execute: async ({ userId }) =>
          `User ${userId} has the following orders: 1`,
      }),
      viewTrackingInformation: tool({
        description: "view tracking information for a specific order",
        parameters: z.object({ orderId: z.string() }),
        execute: async ({ orderId }) =>
          `Here is the tracking information for ${orderId}`,
      }),
    },
    experimental_telemetry: {
      isEnabled: true,
      functionId: "my-awesome-function",
      metadata: {
        something: "custom",
        someOtherThing: "other-value",
      },
    },
    maxSteps: 10,
  });

  await sdk.shutdown();
}

main().catch(console.error);

Why telemetry alone is not enough: The Vercel AI SDK emits and forwards spans, but telemetry by itself does not score outputs, build datasets, or compare prompt and model changes. Telemetry captures what happened during a request, while evals determine whether the response met the quality bar. Connecting the two is what turns production traces into a repeatable testing workflow.

Trace edge, serverless, and streaming workloads

Edge and serverless runtimes make flushing part of the tracing setup. A long-lived Node process can keep sending spans after request handling finishes, but an edge or serverless function may stop as soon as the response returns. If spans remain buffered when the runtime shuts down, the trace can be incomplete.

Streaming responses do not need separate span handling. When a model streams its response, the chunks are collected and logged as one complete span, so the trace shows the full model call as a single entry. Enable telemetry on a streaming call the same way you enable it on a regular AI SDK call.

typescript



export async function POST(req: Request) {
  const { prompt } = await req.json();

  const result = await streamText({
    model: openai("gpt-5-mini"),
    prompt,
    experimental_telemetry: { isEnabled: true },
  });

  return result.toDataStreamResponse();
}

Flushing depends on the runtime. Braintrust logs in the background by default through asyncFlush: true, which prevents trace delivery from blocking user responses. On Vercel, the SDK detects the environment and uses waitUntil to finish flushing after the response is sent. On a runtime without a background-work primitive, such as AWS Lambda, set asyncFlush to false or call flush explicitly so spans are written before the function exits.

When async flushing remains enabled and you need to force delivery at a known point, call .flush() during cleanup or before a worker shuts down.

typescript


const logger = initLogger({
  projectName: "My Project",
  apiKey: process.env.BRAINTRUST_API_KEY,
});

// ... Your application logic ...

// Some function that is called while cleaning up resources
async function cleanup() {
  await logger.flush();
}

For serverless deployments, the main decision is whether the runtime can finish background work after the response is sent. When it cannot, flush before the function exits so production traces do not lose the spans needed for debugging and evaluation.

Turn traces into eval datasets

Tracing becomes more valuable when production failures become part of the evaluation process. Instead of treating a bad response, broken tool call, or missed retrieval result as a one-time debugging case, teams can save the relevant trace into a dataset and run future prompt, model, or application changes against it.

The workflow has three parts. Select the traces worth preserving, add them to a dataset, and run an eval over the dataset. In Braintrust, you can filter logs in the UI by user feedback, topic classification, metadata, errors, or other trace attributes, then add the selected examples to a dataset. You can also create the dataset programmatically with the SDK by inserting records that include an input, an optional expected output, and metadata.

Selecting the right traces is where active observability helps. Instead of scrolling logs and hoping a bad request is still visible, Braintrust classifies every trace with Topics by intent, sentiment, and issue, plus any custom facets you define, so recurring failures surface across all production traffic rather than only the requests you happen to open. Those grouped patterns become the candidates you promote into a dataset.

typescript


async function main() {
  // Initialize dataset (creates it if it doesn't exist)
  const dataset = initDataset("My App", { dataset: "Customer Support" });

  // Insert records with input, expected output, and metadata
  dataset.insert({
    input: { question: "How do I reset my password?" },
    expected: { answer: "Click 'Forgot Password' on the login page." },
    metadata: { category: "authentication", difficulty: "easy" },
  });

  dataset.insert({
    input: { question: "What's your refund policy?" },
    expected: { answer: "Full refunds within 30 days of purchase." },
    metadata: { category: "billing", difficulty: "easy" },
  });

  dataset.insert({
    input: { question: "How do I integrate your API with NextJS?" },
    expected: { answer: "Install the SDK and use our React hooks." },
    metadata: { category: "technical", difficulty: "medium" },
  });

  // Flush to ensure all records are saved
  await dataset.flush();
  console.log("Dataset created with 3 records");
}

main();

With a dataset in place, an eval needs three components. The dataset supplies the examples, the task runs your application on each input, and the scorers grade the output. To run an eval against a dataset created from traces, point data at the saved dataset with initDataset, define the task, and pass the scorer you want to use.

typescript



Eval(
  "Say Hi Bot", // Replace with your project name
  {
    data: initDataset("My App", { dataset: "My Dataset" }),
    task: async (input) => {
      return "Hi " + input; // Replace with your LLM call
    },
    scores: [Levenshtein],
  },
);

Once production traces become dataset records, any failure caught in production can become a release check. A regression found in production is captured once, added to the evaluation set, and scored against future changes before similar behavior reaches users again.

Start free with Braintrust to trace your TypeScript LLM app and turn production failures into release checks.

FAQs: how to trace LLM apps in TypeScript

Do Python-first observability tools have usable JS SDKs?

Some do, though a published package name says little about production fit. A thin JavaScript wrapper often covers basic Node services but breaks down on Next.js, Vercel Edge, or Cloudflare Workers, where runtime shutdown and bundler behavior decide whether spans actually arrive. Verify the SDK has been tested in the runtime you deploy to before committing.

How is tracing in TypeScript different from tracing in Python?

The trace concepts are the same, but TypeScript changes the implementation details. Teams usually care more about typed request metadata, framework integration, bundler compatibility, and runtime shutdown behavior because TypeScript LLM apps often run in Next.js, serverless functions, or edge environments. Python tracing more often starts from long-lived backend services, while TypeScript tracing has to account for where the request actually runs.

Does the Vercel AI SDK's telemetry replace a tracing backend?

The SDK produces OpenTelemetry spans, but a backend is what stores, filters, and queries them, connects them to datasets, and scores them against future releases. Without a tracing backend, the spans exist but no one can investigate a failure or reuse it.

Can I trace edge and serverless functions?

Yes, as long as spans are delivered before the runtime stops. Vercel finishes flushing through waitUntil after the response is sent, while runtimes without that primitive, such as AWS Lambda, need a synchronous flush before the function exits.

Which LLM tracing tool should I use?

Choose a tracing tool that fits the runtime, framework, and quality workflow behind your LLM app. For a TypeScript stack, prioritize a first-class TypeScript SDK, OpenTelemetry compatibility, support for agent and tool traces, and a path from production traces to evals. Braintrust is the strongest fit when tracing needs to connect directly to datasets, scoring, and release checks. For a broader comparison, see best LLM tracing tools.

PreviousOpenTelemetry for LLM tracing: a guide to instrumenting agents and routing spans anywhere