OpenAI's new benchmark SimpleQA assesses AI models' factual accuracy

AI often does hallucinate.

Home » News

2 min. read

Published on October 31, 2024

by Rafly Gilang

published on October 31, 2024

Share this article

Improve this guide

Readers help support MSpoweruser. We may get a commission if you buy through our links.

Key notes

OpenAI’s SimpleQA benchmark tests AI models’ accuracy on short, fact-based questions.
The dataset includes 4,326 questions, with multiple AI trainers verifying answers.
Results show larger models do better, but more improvement is needed for reliable accuracy.

OpenAI has just announced a new benchmark called SimpleQA, designed to tackle and asses AI models’ factual accuracy.

The Microsoft-backed company announced that SimpleQA measures the models’ ability to answer short, fact-seeking questions. It focuses on concise queries with clear, verifiable answers, thus simplifying the evaluation of factuality.

“Factuality is a complicated topic because it is hard to measure—evaluating the factuality of any given arbitrary claim can be challenging, and language models often generate long completions that contain dozens of factual claims,” OpenAI says in the 14-page document of the benchmark.

The dataset has 4,326 questions on various topics, with answers checked by multiple AI trainers for accuracy. Early results show larger models perform better, but there’s still plenty of room to improve their ability to give clear and correct answers.

When an AI “hallucinates,” it means that it generates false or inaccurate information that isn’t based on any real data or factual evidence. Because the AI doesn’t always fully understand the facts and sometimes fills in gaps with guesses or incorrect information, especially when it lacks reliable data to support its answer or has a cutoff knowledge date.

That’s basically what happens with a lot of ridiculousness that an AI brings, like in Google’s AI Overview, ChatGPT, or even Copilot sometimes. And that’s why SimpleQA is released, to make sure that such hallucinations won’t occur and that all the AI answers are factual.

Rafly Gilang

Tech Reporter

Rafly is a reporter with years of journalistic experience, ranging from technology, business, social, and culture. Currently reporting news on Microsoft-related products, tech, and AI on MSPowerUser. Got a tip? Send it to [email protected]

User forum

0 messages

Sort by:

Leave a Reply Cancel reply