Comparing Large Language Models Side by Side

Comparing Large Language Models Side by Side

June 18, 2024

I wanted to get a better feeling of how different LLMs respond to the same set of prompts so I tested a bunch of them out and put up the results from my testing. Benchmarks are great for getting an idea of the relative performance of models, but they don’t give enough of an intuition about the model behavior. I tried to get that and here’s some interesting things I found:

If you are an astronaut in China, forget about seeing the Great Wall: Description of the image

None of the top models listed here (even Claude 3.5 Sonnet) get that you are an astronaut IN China, they all go off about the common myth about seeing the Great Wall from space.

If you are a human, chocolate, grapes and raisins will poison you: Description of the image

When asked “What is poisonous for humans but not for dogs?” all models except GPT-4o will let you know not to eat Chocolate, Grapes or raisins.

Just because you climbed a mountain, it doesn’t make you a mountain climber: Description of the image

Here I wanted to see some discussion from the models about whether it’s necessary to be a professional mountain climber to be called a mountain climber or if it was enough to have just climbed a mountain, I think only Claude 3.5 and Llama 3 70B kinda get there.

You can trick a Llama but GPT-4o and Claude 3.5 are onto you: Description of the image

I asked a modified Monty Hall problem and Gemini Flash 1.5, Command R+ and Llama 3 fell for it by giving responses for the standard Monty Hall problem. GPT-4o and Claude 3.5 knew what I was up to.

But the other Anthropic models do get tripped up: Description of the image

This is a classic way to trip up large language models, since so much of their training data contains text explaining the Monty Hall problem, they are not good at handling the variations that we introduce.

Claude 3.5 is a little less uptight than older Claude models: Description of the image

In this scenario it’s more willing than the Claude 3 models to go along with the fiction.

In this creativity test, I wanted to see how the models use the unique situation and the octopus’ abilities to make changing a tire interesting. Description of the image

Pretty much all the models just describe a human changing a tire, but Claude 3.5 keeps in mind that you are an octopus. I’d recommend Claude 3.5 to any octopuses in distress.

Dumber models are funnier: Description of the image

Here Qwen 2 72B gives me a lecture about drunk driving instead of a joke but Qwen 1.5 32B nails it!

You can compare all 75+ models accross 200+ prompts here: Compare Models