OpenAI's new benchmark for AI, MLE-bench, competes with human data scientists

MLE-bench is an open source benchmark.

Home » News

2 min. read

Published on October 11, 2024

by Rafly Gilang

published on October 11, 2024

Share this article

Improve this guide

Readers help support MSpoweruser. We may get a commission if you buy through our links.

Key notes

OpenAI’s MLE-Bench evaluates AI against 75 real-world Kaggle competitions.
It tests AI capabilities in planning, troubleshooting, and innovation.
The o1-preview model excelled in 16.9% of competitions but struggled with adaptability.

OpenAI has introduced MLE-bench. It is a new benchmark that tests AI systems against 75 real-world data science competitions from Kaggle to measure their capabilities in machine learning engineering.

Pitting against real-life human scientists, the Microsoft-backed company says that the benchmark takes AI performance beyond computational tasks. The open-source benchmark evaluates the model’s ability to plan, troubleshoot, and innovate in whatever field that it’s being presented to.

“Since Kaggle does not provide the held-out test set for each competition, we provide preparation scripts that split the publicly available training set into a new training and test set,” OpenAI describes. You can check the benchmark’s repository on GitHub, which provides all the codes and datasets.

You can set up the environment using Git-LFS, install the package via pip, and prepare the dataset, which involves downloading Kaggle data and splitting it for training and testing. The repository includes grading scripts for evaluating competition submissions in CSV format, as well as a Docker image for a consistent execution environment for agents.

Kaggle is Google’s data science competition platform and online community for data scientists and ML practitioners. It serves as a hub for data science and machine learning enthusiasts.

OpenAI’s latest model, o1-preview, achieved quite notable results, performing at a medal-worthy level in 16.9% of the competitions. Still, it struggled with tasks requiring adaptability and creative problem-solving.

The o1 model family itself excels in reasoning abilities—perhaps among the best models for such task— with a mini version available. It’s currently available to use for ChatGPT paid users alongside the GPT-4o and the 4o-mini.

Rafly Gilang

Tech Reporter

Rafly is a reporter with years of journalistic experience, ranging from technology, business, social, and culture. Currently reporting news on Microsoft-related products, tech, and AI on MSPowerUser. Got a tip? Send it to [email protected]

User forum

0 messages

Sort by:

Leave a Reply Cancel reply