Meta's upcoming Llama-3 400B model could potentially beat GPT-4 Turbo and Claude 3 Opus

It doesn't exceed them, but it has a potential

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • Meta unveils Llama-3, its yet most powerful model with 700B parameters
  • Llama-3 shows potential for improvement despite being in training phase
  • Recent numbers suggest that it’s close to Claude 3 Opus and GPT-4 Turbo in benchmarks

Meta is set to launch its yet most powerful AI model, the Llama-3 with 400B parameters. In its announcement on Thursday, the open-source model will soon power the Meta AI assistant tool that’s coming to WhatsApp and Instagram. 

But the truth is, there are plenty of powerful AI models in the market at the moment. GPT-4 Turbo with a 128k context window from OpenAI has been around for quite some time, and Claude 3 Opus from Anthropic is now available on Amazon Bedrock.

So, how do these models compare to one another, based on several benchmarks? Here’s a comparison of how these powerful models tested in several options. These figures are taken from publicly available info and Meta’s announcement.

BenchmarkLlama 3 400BClaude 3 OpusGPT-4 TurboGemini Ultra 1.0Gemini Pro 1.5
MMLU86.186.886.583.781.9
GPQA4850.449.1
HumanEval84.184.987.674.471.9
MATH57.860.172.253.258.5

As you can see, Llama-3 400B actually does fall slightly short in these benchmarks, scoring 86.1 in MMLU, 48 in GPQA, 84.1 in HumanEval, and 57.8 in MATH. 

But, given that it’s still in the training phase, there’s a good possibility for big improvements once it’s fully deployed. And for an open-source model, that’s way beyond impressive. 

MMLU tests how well models understand different subjects without directly teaching them, covering a wide range of topics. GPQA, on the other hand, sorts models on how well they’re doing in biology, physics, and chemistry, while HumanEval focuses on how they code.