June 18, 2024
I wanted to get a better feeling of how different LLMs respond to the same set of prompts so I tested a bunch of them out and put up the results from my testing. Benchmarks are great for getting an idea of the relative performance of models, but they don’t give enough of an intuition about the model behavior. I tried to get that and here’s some interesting things I found:
If you are an astronaut in China, forget about seeing the Great Wall:None of the top models listed here (even Claude 3.5 Sonnet) get that you are an astronaut IN China, they all go off about the common myth about seeing the Great Wall from space.
If you are a human, chocolate, grapes and raisins will poison you:When asked “What is poisonous for humans but not for dogs?” all models except GPT-4o will let you know not to eat Chocolate, Grapes or raisins.
Just because you climbed a mountain, it doesn’t make you a mountain climber:Here I wanted to see some discussion from the models about whether it’s necessary to be a professional mountain climber to be called a mountain climber or if it was enough to have just climbed a mountain, I think only Claude 3.5 and Llama 3 70B kinda get there.
You can trick a Llama but GPT-4o and Claude 3.5 are onto you:I asked a modified Monty Hall problem and Gemini Flash 1.5, Command R+ and Llama 3 fell for it by giving responses for the standard Monty Hall problem. GPT-4o and Claude 3.5 knew what I was up to.
But the other Anthropic models do get tripped up:This is a classic way to trip up large language models, since so much of their training data contains text explaining the Monty Hall problem, they are not good at handling the variations that we introduce.
Claude 3.5 is a little less uptight than older Claude models:In this scenario it’s more willing than the Claude 3 models to go along with the fiction.
In this creativity test, I wanted to see how the models use the unique situation and the octopus’ abilities to make changing a tire interesting.Pretty much all the models just describe a human changing a tire, but Claude 3.5 keeps in mind that you are an octopus. I’d recommend Claude 3.5 to any octopuses in distress.
Dumber models are funnier:Here Qwen 2 72B gives me a lecture about drunk driving instead of a joke but Qwen 1.5 32B nails it!
You can compare all 75+ models accross 200+ prompts here: Compare Models