xAI recently released their latest family of Grok models: Grok 4 and the premium Grok 4 Heavy. If you missed it, you can catch a replay of the livestream.
According to the xAI team, these reasoning-only models (you can't turn it off) provide substantial improvements over Grok 3, primarily because they were trained to use tools rather than just generalize on their own.
While presenting the latest model as a means to further "maximizing truth seeing", Musk boldly claimed that these new models are "smarter than almost all graduate students in all disciplines simultaneously" and "better than PhD students" when it comes to academic questions.
But the real question is, "Can it create a good 'pelican riding a bicycle' SVG?"
Simon Willison has a rather unique test that he runs on new models, where he asks the model to create an image of a pelican riding a bicycle, followed by a request to describe the image it created.
Quirky tests like these can help us understand the proclivities of various LLMs. You can read Simon's full writeup on his experience with Grok 4 on his blog.
With Braintrust, you can evaluate tests like this in a systematic way. In this post, we'll share how you might set up tasks and scorers to
understand how well each model does on these kind of tasks, starting with Grok 4. To make things interesting, we'll define a custom 'LLM-as-Jury' scorer that combines several LLM-as-a-judge scorers from OpenAI, Anthropic, and xAI.
The first thing you need to do is create a Braintrust project.
Next, we'll import some libraries and set up our OpenAI client to call out to xAI. Make sure you have the appropriate API keys configured in your own .env file.
Some data (a list of inputs we want to use to evaluate a task on)
A task (a function like an LLM call that takes a single example from our data to perform some work)
A scorer (a means to know how well our task performed)
Since our data will come by way of queries to create and describe an SVG image, we can move on to defining the task we want to evaluate.
First, we need a method to generate an SVG.
@bt.traced()def create_svg_image(image_description: str, client, model_name: str, generation_kwargs: dict = {}): rsp = client.chat.completions.create( model=model_name, messages=[{"role": "user", "content": image_description}], **generation_kwargs, ) # Extract svg content - handle both markdown wrapped and plain SVG content = rsp.choices[0].message.content # type: ignore # Remove markdown code blocks if present # ... # Find SVG content if it's embedded in text if "<svg" in content: start = content.find("<svg") end = content.find("</svg>") + 6 if start != -1 and end != 5: # end != 5 means </svg> was found content = content[start:end] svg_string = content.strip() return svg_string
When you run this method with some code like this:
svg_string = create_svg_image( "Generate an SVG of a pelican riding a bicycle", client=wrapped_grok_client, model_name="grok-4-0709", generation_kwargs={"max_tokens": 10000},)display(SVG(data=svg_string))
... you'll get something like this:
Second, we'll need a task that takes an image and uses the same model to generate a description of the image.
@bt.traced()def describe_image(image_path: str, client, model_name: str, generation_kwargs: dict = {}): with open(image_path, "rb") as image_file: image_data = base64.b64encode(image_file.read()).decode() image_url = f"data:image/png;base64,{image_data}" rsp = client.chat.completions.create( model=model_name, messages=[ { "role": "system", "content": "Describe this image in markdown format. Include the following sections: Simple Description, Main Subject, Background and Setting, Style and Tone\nUse bullet points for all sections after the Simple Description section.", }, { "role": "user", "content": [ {"type": "text", "text": "Describe this image"}, {"type": "image_url", "image_url": {"url": image_url}}, ], }, ], **generation_kwargs, ) content = rsp.choices[0].message.content # type: ignore return image_url, content
This returns something like this:
## Simple DescriptionThe image depicts a minimalist cartoon illustration of a white bird-like figure with a yellow beak, small wings, and an orange leg extended downward, appearing to interact with a small gray object via directional arrows, all set against a solid black background.## Main Subject- A central white, oval-shaped figure resembling a cartoon bird or penguin- Features a small yellow beak pointing to the right- Small, outstretched white wings on either side of the body- An orange leg extending downward from the body, with an arrow along it pointing down- A small gray oval or blob-like object at the end of the leg- A larger downward arrow below the gray object, suggesting motion or direction## Background and Setting- Entirely solid black, creating a void-like environment- No additional scenery, objects, or details present- The setting emphasizes isolation and focus on the central subject## Style and Tone- Highly simplistic and minimalist, using basic geometric shapes like ovals and lines- Cartoonish and illustrative, with flat colors and no shading or depth- Neutral to slightly whimsical tone, possibly educational or diagrammatic due to the arrows indicating direction or force
And lastly, we'll need a top-level task that puts these all together:
@bt.traced()def create_and_describe_image(image_description: str, client, model_name: str, generation_kwargs: dict = {}): # Create SVG Image svg_string = create_svg_image( image_description, client=client, model_name=model_name, generation_kwargs=generation_kwargs ) # Convert SVG to PNG and save os.makedirs("_temp", exist_ok=True) png_data = cairosvg.svg2png(bytestring=svg_string.encode("utf-8")) with open("_temp/created_image.png", "wb") as f: f.write(png_data) # Ask model to describe the image it created image_url, description = describe_image( image_path="_temp/created_image.png", client=client, model_name=model_name, generation_kwargs=generation_kwargs ) return {"image_url": image_url, "description": description}
The last component required to run an eval is one or more scorers. To demonstrate how to build your own custom scorers, we'll define
an LLM-as-Jury which uses multiple LLM-as-Judge classifiers to derive a final judgement on how well the model did with describing the image it created.
In this example, we define OpenAI, Anthropic, and Grok judges, and average their scores to arrive at a final verdict.
class LikertScale(BaseModel): score: int = Field( ..., description="A score between 1 and 5 (1 is the worst score and 5 is the best score).", min_value=1, max_value=5, ) # type: ignore rationale: str = Field(..., description="A rationale for the score.")def ask_llm_judge_about_image_description(client, model_name, input, output): gen_kwargs = {"response_format": LikertScale} if model_name.startswith("claude"): gen_kwargs = {} rsp = client.chat.completions.parse( model=model_name, messages=[ { "role": "system", "content": dedent("""\ You are a critical expert in determining if a generated image matches what the user asked for and whether or not an AI model did a good job in describing that image. The score must be an integer between 1 and 5. You should respond ONLY with a JSON object with this format:{score:int, rationale:str}. Make sure you escape any characters that are not valid JSON. Only response with a string that can be parsed as JSON using `json.loads()`. Double check your work! """), }, { "role": "user", "content": [ {"type": "text", "text": f"Here is the image generated from the description: {input}"}, {"type": "image_url", "image_url": {"url": output["image_url"]}}, { "type": "text", "text": f"Here is the description of the generated image: {output['description']}", }, { "type": "text", "text": "Return a score between 1 and 5 based on how well the image matches the description and how well the description matches the image. 1 is the worst score and 5 is the best score.", }, ], }, ], **gen_kwargs, ) if model_name.startswith("claude"): parsed = json.loads(rsp.choices[0].message.content) return (parsed["score"] - 1) / 4 else: parsed: LikertScale = rsp.choices[0].message.parsed return (parsed.score - 1) / 4def is_good_description(input, output, expected=None, metadata=None): oai_judge_score = partial( ask_llm_judge_about_image_description, client=openai_client, model_name="gpt-4o", input=input, output=output )() anthropic_judge_score = partial( ask_llm_judge_about_image_description, client=anthropic_client, model_name="claude-3-5-sonnet-20240620", input=input, output=output, )() grok_judge_score = partial( ask_llm_judge_about_image_description, client=wrapped_grok_client, model_name="grok-4-0709", input=input, output=output, )() return [ Score(name="is_good_description_judge_oai", score=oai_judge_score), Score(name="is_good_description_judge_anthropic", score=anthropic_judge_score), Score(name="is_good_description_judge_grok", score=grok_judge_score), Score(name="is_good_description_jury", score=(oai_judge_score + anthropic_judge_score + grok_judge_score) / 3), ]
When we run that against our outputs from create_and_describe_image(), we'll get something like this to add to our traces:
score = is_good_description( input="Create an SVG of a two cats riding a bicycle", output=rsp,)score# [Score(name='is_good_description_judge_oai', score=1.0, metadata={}, error=None),# Score(name='is_good_description_judge_anthropic', score=0.75, metadata={}, error=None),# Score(name='is_good_description_judge_grok', score=1.0, metadata={}, error=None),# Score(name='is_good_description_jury', score=0.9166666666666666, metadata={}, error=None)]
Here, we'll run a single eval with Grok 4, but this can also be extended to add more image descriptions and tests with different models.
current_date_str = datetime.now().strftime("%Y%m%d%H")print(current_date_str)# This code was written to run in a Jupyter notebookawait bt.EvalAsync( name="YOUR_PROJECT_NAME", experiment_name=f"reasoning-xai-grok4-0709-{current_date_str}", data=lambda: [bt.EvalCase(input="Generate an SVG of a pelican riding a bicycle")], # type: ignore task=partial( create_and_describe_image, client=wrapped_grok_client, model_name="grok-4-0709", generation_kwargs={"max_tokens": 10000}, ), scores=[is_good_description], metadata={"vendor": "xai", "model": "grok-4-0709"},)
In addition to improving the scorers, you can add more image descriptions to test these models out, as well as test more models. Braintrust makes it easy to group
and aggregate results by vendor or model family so that you can systematically measure the progress of these models over time.
If you have any interesting tests you run when a new model comes out, let us know!