Is Claude 3 really better than GPT-4? Promptbase's benchmarking says different

Head-to-head tests show GPT-4 Turbo edges out Claude 3 in all categories.

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Anthropic recently launched Claude 3, touted to outperform GPT-4 and Google Gemini 1.0 Ultra.
  • Posted benchmark scores indicate Claude 3 Opus excels in various areas compared to its counterparts.
  • However, further analysis suggests GPT-4 Turbo outperforms Claude 3 in direct comparisons, implying potential biases in reported results.

Anthropic has just launched Claude 3 not too long ago, its AI model that’s said to be able to beat OpenAI’s GPT-4 and Google Gemini 1.0 Ultra. It comes with three variants: Claude 3 Haiku, Sonnet, and Opus, all for different uses.

In its initial announcement, the AI company says that Claude 3 is slightly superior to these two recently launched models. 

According to the posted benchmark scores, Claude 3 Opus is better in undergraduate-level knowledge (MMLU), graduate-level reasoning (GPQA), grade school math and math problem-solving, multilingual math, coding, reasoning over text, and others more than GPT-4 and Gemini 1.0 Ultra and Pro.

However, that does not entirely paint the entire picture truthfully. The posted benchmark score on the announcement (especially for GPT-4) was apparently taken from GPT-4 on the release version from March 2023 last year (credits to AI enthusiast @TolgaBilge_ on X)

A tool that analyzes performance (benchmark analyzer) called Promptbase shows that GPT-4 Turbo actually does beat Claude 3 in all the tests they could directly compare them on. These tests cover things like basic math skills (GSM8K & MATH), writing code (HumanEval), reasoning over text (DROP), and a mix of other challenges.

While announcing their results, Anthropic also mentions in a footnote that their engineers were able to improve GPT-4T’s performance further by fine-tuning it specifically for the tests. This suggests the reported results might not reflect the true capabilities of the base model.

Ouch.