Contributed by Ornella Altunyan on 2024-12-14
The OpenAI Realtime API, designed for building advanced multimodal conversational experiences, unlocks even more use cases in AI applications. However, evaluating this and other audio models’ outputs in practice is an unsolved problem. In this cookbook, we’ll build a robust application with the Realtime API, incorporating tool-calling and user input. Then, we’ll evaluate the results. Let’s get started!
Getting started
In this cookbook, we’re going to build a speech-to-speech RAG agent that answers questions about the Braintrust documentation. To get started, you’ll need a few accounts: andnode, npm, and typescript installed locally. If you’d like to follow along in code,
the realtime-rag
project contains a working example with all of the documents and code snippets we’ll use.
Clone the repo
To start, clone the repo and install the dependencies:.env.local file with your API keys:
OPENAI_API_KEY environment variable in the AI providers section
of your account, and set the PINECONE_API_KEY environment variable in the Environment variables section.
We’ll use the local environment variables to embed and upload the vectors, and
the Braintrust variables to run the RAG tool and LLM calls remotely.
Upload the vectors
To upload the vectors, run theupload-vectors.ts script:
docs-sample directory, breaks them into sections based on headings, and creates vector embeddings for each section using OpenAI’s API. It then stores those embeddings along with the section’s title and content in Pinecone.
That’s it for setup! Now let’s dig into the code.
Accessing the Realtime API
Building with the OpenAI Realtime API is complex because it is built on WebSockets, and it lacks client-side authentication. However, the Braintrust AI Proxy makes it easy to connect to the API in a secure and scalable way. The proxy securely manages your OpenAI API key, issuing temporary credentials to your backend and frontend. The frontend sends any voice data from your app to the proxy, which handles secure communication with OpenAI’s Realtime API. To access the Realtime API through the Braintrust proxy, we changed the proxy URL when instantiating theRealtimeClient to https://braintrustproxy.com/v1/realtime. In our app, the RealtimeClient is initialized when the ConsolePage component is rendered.
We set up this logic in page.tsx:
You can also use our proxy with an AI provider’s API key, but you will not
have access to other Braintrust features, like logging.
Creating a RAG tool
The retrieval logic also happens on the server side. We set up the helper function and route handler that queries Pinecone inroute.ts so that we can call the retrieval tool on the client side like this:
Currently, because of the way the Realtime API works, we have to use OpenAI
tool calling here instead of Braintrust tool functions.
Setting up the system prompt
When we call the Realtime API, we pass it a set of instructions that are configured inconversation_config.js:
Running the app
To run the app, navigate to/web and run npm run dev. You should have the app load on localhost:3000.
Start a new conversation, and ask a few questions about Braintrust. Feel free to interrupt the bot, or ask unrelated questions, and see what happens. When you’re finished, end the conversation. Have a couple of conversations to get a feel for some of the limitations and nuances of the bot - each conversation will come in handy in the next step.
Logging in Braintrust
In addition to client-side authentication, you’ll also get the other benefits of building with Braintrust, like logging, built in. When you ran the app and connected to the Realtime API, logs were generated for each conversation. When you closed the session, the log was complete and ready to view in Braintrust. Each LLM and tool call is contained in its own span inside of the trace. In addition, the audio files were uploaded as attachments in your trace. This means that you don’t have to exit the UI to listen to each of the inputs and outputs for the LLM calls.
Online evaluations
In Braintrust, you can run server-side online evaluations that are automatically run asynchronously as you upload logs. This makes it easier to evaluate your app in situations like this, where the prompt and tool might not be synced to Braintrust. Audio evals are complex, because there are multiple aspects of your application you can focus on. In this cookbook, we’ll use the vector search query as a proxy for the quality of the Realtime API’s interpretation of the user’s input.Setting up your scorer
We’ll need to create a scorer that captures the criteria we want to evaluate. Since we’re dealing with complex RAG outputs, we’ll use a custom LLM-as-a-judge scorer. For an LLM-as-a-judge scorer, you define a prompt that evaluates the output and maps its choices to specific scores. Navigate to Scorers and create a new scorer. Call your scorer BraintrustRAG and add the following prompt:
Configuring your online eval
Navigate to Configuration and scroll down to Online scoring. Select Add rule to configure your online scoring rule. Select the scorer we just created from the menu, and deselect Apply to root span. We’ll filter to the function span since that’s where our tool is called.
Viewing your evaluations
Now that you’ve set up your online evaluations, you can view the scores from within your logs. Underneath each function span that was included in the sampling rate, you’ll have an additional span with the score.
Improving your evals
There are three main ways to improve your evals:- Refine the scoring function to ensure it accurately reflects the success criteria.
- Add new scoring functions to capture different performance aspects (for example, correctness or efficiency).
- Expand your dataset with more diverse or challenging test cases.
Improving our existing scorer
Let’s change the prompt for our scoring function to: