Evaluating a chat assistant

Contributed by Tara Nagar on 2024-07-16

Evaluating a multi-turn chat assistant

This tutorial will walk through using Braintrust to evaluate a conversational, multi-turn chat assistant. These types of chat bots have become important parts of applications, acting as customer service agents, sales representatives, or travel agents, to name a few. As an owner of such an application, it’s important to be sure the bot provides value to the user. We will expand on this below, but the history and context of a conversation is crucial in being able to produce a good response. If you received a request to “Make a dinner reservation at 7pm” and you knew where, on what date, and for how many people, you could provide some assistance; otherwise, you’d need to ask for more information. Before starting, please make sure you have a Braintrust account. If you do not have one, you can sign up here.

Installing dependencies

Begin by installing the necessary dependencies if you have not done so already.

pnpm install autoevals braintrust openai

Inspecting the data

Let’s take a look at the small dataset prepared for this cookbook. You can find the full dataset in the accompanying dataset.ts file. The assistant turns were generated using claude-3-5-sonnet-20240620. Below is an example of a data point.

chat_history contains the history of the conversation between the user and the assistant
input is the last user turn that will be sent in the messages argument to the chat completion
expected is the output expected from the chat completion given the input

import dataset, { ChatTurn } from "./assets/dataset";

console.log(dataset[0]);

{
  chat_history: [
    {
      role: 'user',
      content: "when was the ballon d'or first awarded for female players?"
    },
    {
      role: 'assistant',
      content: "The Ballon d'Or for female players was first awarded in 2018. The inaugural winner was Ada Hegerberg, a Norwegian striker who plays for Olympique Lyonnais."
    }
  ],
  input: "who won the men's trophy that year?",
  expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}

From looking at this one example, we can see why the history is necessary to provide a helpful response. If you were asked “Who won the men’s trophy that year?” you would wonder What trophy? Which year? But if you were also given the chat_history, you would be able to answer the question (maybe after some quick research).

Running experiments

The key to running evals on a multi-turn conversation is to include the history of the chat in the chat completion request.

Assistant with no chat history

To start, let’s see how the prompt performs when no chat history is provided. We’ll create a simple task function that returns the output from a chat completion.

import { wrapOpenAI } from "braintrust";
import { OpenAI } from "openai";

const experimentData = dataset.map((data) => ({
  input: data.input,
  expected: data.expected,
}));
console.log(experimentData[0]);

async function runTask(input: string) {
  const client = wrapOpenAI(
    new OpenAI({
      baseURL: "https://api.braintrust.dev/v1/proxy",
      apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here
    }),
  );

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content:
          "You are a helpful and polite assistant who knows about sports.",
      },
      {
        role: "user",
        content: input,
      },
    ],
  });
  return response.choices[0].message.content || "";
}

{
  input: "who won the men's trophy that year?",
  expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}

Scoring and running the eval

We’ll use the Factuality scoring function from the autoevals library to check how the output of the chat completion compares factually to the expected value. We will also utilize trials by including the trialCount parameter in the Eval call. We expect the output of the chat completion to be non-deterministic, so running each input multiple times will give us a better sense of the “average” output.

import { Eval } from "braintrust";
import Factuality from "autoevals";

Eval("Chat assistant", {
  experimentName: "gpt-4o assistant - no history",
  data: () => experimentData,
  task: runTask,
  scores: [Factuality],
  trialCount: 3,
  metadata: {
    model: "gpt-4o",
    prompt: "You are a helpful and polite assistant who knows about sports.",
  },
});

Experiment gpt - 4o assistant - no history is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints


=========================SUMMARY=========================
61.33% 'Factuality' score       (0 improvements, 0 regressions)

4.12s 'duration'        (0 improvements, 0 regressions)
0.01$ 'estimated_cost'  (0 improvements, 0 regressions)

See results for gpt-4o assistant - no history at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history

61.33% Factuality score? Given what we discussed earlier about chat history being important in producing a good response, that’s surprisingly high. Let’s log onto braintrust.dev and take a look at how we got that score.

Interpreting the results

If we look at the score distribution chart, we can see ten of the fifteen examples scored at least 60%, with over half even scoring 100%. If we look into one of the examples with 100% score, we see the output of the chat completion request is asking for more context as we would expect:

Could you please specify which athlete or player you're referring to? There are many professional athletes, and I'll need a bit more information to provide an accurate answer.

This aligns with our expectation, so let’s now look at how the score was determined.

Click into the scoring trace, we see the chain of thought reasoning used to settle on the score. The model chose (E) The answers differ, but these differences don't matter from the perspective of factuality. which is technically correct, but we want to penalize the chat completion for not being able to produce a good response.

Improve scoring with a custom scorer

While Factuality is a good general purpose scorer, for our use case option (E) is not well aligned with our expectations. The best way to work around this is to customize the scoring function so that it produces a lower score for asking for more context or specificity.

import { LLMClassifierFromSpec, Score } from "autoevals";

function Factual(args: {
  input: string;
  output: string;
  expected: string;
}): Score | Promise<Score> {
  const factualityScorer = LLMClassifierFromSpec("Factuality", {
    prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data:
              [BEGIN DATA]
              ************
              [Question]: {{{input}}}
              ************
              [Expert]: {{{expected}}}
              ************
              [Submission]: {{{output}}}
              ************
              [END DATA]

              Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
              The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
              (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
              (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
              (C) The submitted answer contains all the same details as the expert answer.
              (D) There is a disagreement between the submitted answer and the expert answer.
              (E) The answers differ, but these differences don't matter from the perspective of factuality.
              (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
              (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`,
    choice_scores: {
      A: 0.4,
      B: 0.6,
      C: 1,
      D: 0,
      E: 1,
      F: 0.2,
      G: 0,
    },
  });
  return factualityScorer(args);
}

You can see the built-in Factuality prompt here. For our customized scorer, we’ve added two score choices to that prompt:

- (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
- (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.

These will score (F) = 0.2 and (G) = 0 so the model gets some credit if there was any context it was able to gather from the user’s input. We can then use this spec and the LLMClassifierFromSpec function to create our customer scorer to use in the eval function. Read more about defining your own scorers in the documentation.

Re-running the eval

Let’s now use this updated scorer and run the experiment again.

Eval("Chat assistant", {
  experimentName: "gpt-4o assistant - no history",
  data: () =>
    dataset.map((data) => ({ input: data.input, expected: data.expected })),
  task: runTask,
  scores: [Factual],
  trialCount: 3,
  metadata: {
    model: "gpt-4o",
    prompt: "You are a helpful and polite assistant who knows about sports.",
  },
});

Experiment gpt - 4o assistant - no history - 934e5ca2 is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints


=========================SUMMARY=========================
gpt-4o assistant - no history-934e5ca2 compared to gpt-4o assistant - no history:
6.67% (-54.67%) 'Factuality' score      (0 improvements, 5 regressions)

4.77s 'duration'        (2 improvements, 3 regressions)
0.01$ 'estimated_cost'  (2 improvements, 3 regressions)

See results for gpt-4o assistant - no history-934e5ca2 at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20no%20history-934e5ca2

6.67% as a score aligns much better with what we expected. Let’s look again into the results of this experiment.

Interpreting the results

In the table we can see the output fields in which the chat completion responses are requesting more context. In one of the experiment that had a non-zero score, we can see that the model asked for some clarification, but was able to understand from the question that the user was inquiring about a controversial World Series. Nice!

Looking into how the score was determined, we can see that the factual information aligned with the expert answer but the submitted answer still asks for more context, resulting in a score of 20% which is what we expect.

Assistant with chat history

Now let’s shift and see how providing the chat history improves the experiment.

Update the data, task function and scorer function

We need to edit the inputs to the Eval function so we can pass the chat history to the chat completion request.

const experimentData = dataset.map((data) => ({
  input: { input: data.input, chat_history: data.chat_history },
  expected: data.expected,
}));
console.log(experimentData[0]);

async function runTask({
  input,
  chat_history,
}: {
  input: string;
  chat_history: ChatTurn[];
}) {
  const client = wrapOpenAI(
    new OpenAI({
      baseURL: "https://api.braintrust.dev/v1/proxy",
      apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral, etc. API keys here
    }),
  );

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content:
          "You are a helpful and polite assistant who knows about sports.",
      },
      ...chat_history,
      {
        role: "user",
        content: input,
      },
    ],
  });
  return response.choices[0].message.content || "";
}

function Factual(args: {
  input: {
    input: string;
    chat_history: ChatTurn[];
  };
  output: string;
  expected: string;
}): Score | Promise<Score> {
  const factualityScorer = LLMClassifierFromSpec("Factuality", {
    prompt: `You are comparing a submitted answer to an expert answer on a given question. Here is the data:
              [BEGIN DATA]
              ************
              [Question]: {{{input}}}
              ************
              [Expert]: {{{expected}}}
              ************
              [Submission]: {{{output}}}
              ************
              [END DATA]

              Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
              The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
              (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
              (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
              (C) The submitted answer contains all the same details as the expert answer.
              (D) There is a disagreement between the submitted answer and the expert answer.
              (E) The answers differ, but these differences don't matter from the perspective of factuality.
              (F) The submitted answer asks for more context, specifics or clarification but provides factual information consistent with the expert answer.
              (G) The submitted answer asks for more context, specifics or clarification but does not provide factual information consistent with the expert answer.`,
    choice_scores: {
      A: 0.4,
      B: 0.6,
      C: 1,
      D: 0,
      E: 1,
      F: 0.2,
      G: 0,
    },
  });
  return factualityScorer(args);
}

{
  input: {
    input: "who won the men's trophy that year?",
    chat_history: [ [Object], [Object] ]
  },
  expected: "In 2018, the men's Ballon d'Or was awarded to Luka Modrić."
}

We update the parameter to the task function to accept both the input string and the chat_history array and add the chat_history into the messages array in the chat completion request, done here using the spread ... syntax. We also need to update the experimentData and Factual function parameters to align with these changes.

Running the eval

Use the updated variables and functions to run a new eval.

Eval("Chat assistant", {
  experimentName: "gpt-4o assistant",
  data: () => experimentData,
  task: runTask,
  scores: [Factual],
  trialCount: 3,
  metadata: {
    model: "gpt-4o",
    prompt: "You are a helpful and polite assistant who knows about sports.",
  },
});

Experiment gpt - 4o assistant is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints


=========================SUMMARY=========================
gpt-4o assistant compared to gpt-4o assistant - no history-934e5ca2:
60.00% 'Factuality' score       (0 improvements, 0 regressions)

4.34s 'duration'        (0 improvements, 0 regressions)
0.01$ 'estimated_cost'  (0 improvements, 0 regressions)

See results for gpt-4o assistant at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant

60% score is a definite improvement from 4%. You’ll notice that it says there were 0 improvements and 0 regressions compared to the last experiment gpt-4o assistant - no history-934e5ca2 we ran. This is because by default, Braintrust uses the input field to match rows across experiments. From the dashboard, we can customize the comparison key (see docs) by going to the project configuration page.

Update experiment comparison for diff mode

Let’s go back to the dashboard. For this cookbook, we can use the expected field as the comparison key because this field is unique in our small dataset. In the Configuration tab, go to the bottom of the page to update the comparison key:

Interpreting the results

Turn on diff mode using the toggle on the upper right of the table.

Since we updated the comparison key, we can now see the improvements in the Factuality score between the experiment run with chat history and the most recent one run without for each of the examples. If we also click into a trace, we can see the change in input parameters that we made above where it went from a string to an object with input and chat_history fields. All of our rows scored 60% in this experiment. If we look into each trace, this means the submitted answer includes all the details from the expert answer with some additional information. 60% is an improvement from the previous run, but we can do better. Since it seems like the chat completion is always returning more than necessary, let’s see if we can tweak our prompt to have the output be more concise.

Improving the result

Let’s update the system prompt used in the chat completion request.

async function runTask({
  input,
  chat_history,
}: {
  input: string;
  chat_history: ChatTurn[];
}) {
  const client = wrapOpenAI(
    new OpenAI({
      baseURL: "https://api.braintrust.dev/v1/proxy",
      apiKey: process.env.OPENAI_API_KEY ?? "", // Can use OpenAI, Anthropic, Mistral etc. API keys here
    }),
  );

  const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content:
          "You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.",
      },
      ...chat_history,
      {
        role: "user",
        content: input,
      },
    ],
  });
  return response.choices[0].message.content || "";
}

In the task function, we’ll update the system message to specify the output should be precise and then run the eval again.

Eval("Chat assistant", {
  experimentName: "gpt-4o assistant - concise",
  data: () => experimentData,
  task: runTask,
  scores: [Factual],
  trialCount: 3,
  metadata: {
    model: "gpt-4o",
    prompt:
      "You are a helpful, polite assistant who knows about sports. Only answer the question; don't add additional information outside of what was asked.",
  },
});

Experiment gpt - 4o assistant - concise is running at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise
 ████████████████████████████████████████ | Chat assistant[experimentName = gpt - 4o... | 100 % | 15 / 15 datapoints


=========================SUMMARY=========================
gpt-4o assistant - concise compared to gpt-4o assistant:
86.67% (+26.67%) 'Factuality' score     (4 improvements, 0 regressions)

1.89s 'duration'        (5 improvements, 0 regressions)
0.01$ 'estimated_cost'  (4 improvements, 1 regressions)

See results for gpt-4o assistant - concise at https://www.braintrust.dev/app/braintrustdata.com/p/Chat%20assistant/experiments/gpt-4o%20assistant%20-%20concise

Let’s go into the dashboard and see the new experiment.

Success! We got a 27 percentage point increase in factuality, up to an average score of 87% for this experiment with our updated prompt.

Conclusion

We’ve seen in this cookbook how to evaluate a chat assistant and visualized how the chat history effects the output of the chat completion. Along the way, we also utilized some other functionality such as updating the comparison key in the diff view and creating a custom scoring function. Try seeing how you can improve the outputs and scores even further!

Recipes

​Evaluating a multi-turn chat assistant

​Installing dependencies

​Inspecting the data

​Running experiments

​Assistant with no chat history

​Scoring and running the eval

​Interpreting the results

​Improve scoring with a custom scorer

​Re-running the eval

​Interpreting the results

​Assistant with chat history

​Update the data, task function and scorer function

​Running the eval

​Update experiment comparison for diff mode

​Interpreting the results

​Improving the result

​Conclusion

Evaluating a multi-turn chat assistant

Installing dependencies

Inspecting the data

Running experiments

Assistant with no chat history

Scoring and running the eval

Interpreting the results

Improve scoring with a custom scorer

Re-running the eval

Interpreting the results

Assistant with chat history

Update the data, task function and scorer function

Running the eval

Update experiment comparison for diff mode

Interpreting the results

Improving the result

Conclusion