OpenAI's new benchmark for AI, MLE-bench, competes with human data scientists

MLE-bench is an open source benchmark.

Reading time icon 2 min. read


Readers help support MSpoweruser. We may get a commission if you buy through our links. Tooltip Icon

Read our disclosure page to find out how can you help MSPoweruser sustain the editorial team Read more

Key notes

  • OpenAI’s MLE-Bench evaluates AI against 75 real-world Kaggle competitions.
  • It tests AI capabilities in planning, troubleshooting, and innovation.
  • The o1-preview model excelled in 16.9% of competitions but struggled with adaptability.
OpenAI, ChatGPT

OpenAI has introduced MLE-bench. It is a new benchmark that tests AI systems against 75 real-world data science competitions from Kaggle to measure their capabilities in machine learning engineering.

Pitting against real-life human scientists, the Microsoft-backed company says that the benchmark takes AI performance beyond computational tasks. The open-source benchmark evaluates the model’s ability to plan, troubleshoot, and innovate in whatever field that it’s being presented to.

“Since Kaggle does not provide the held-out test set for each competition, we provide preparation scripts that split the publicly available training set into a new training and test set,” OpenAI describes. You can check the benchmark’s repository on GitHub, which provides all the codes and datasets.

You can set up the environment using Git-LFS, install the package via pip, and prepare the dataset, which involves downloading Kaggle data and splitting it for training and testing. The repository includes grading scripts for evaluating competition submissions in CSV format, as well as a Docker image for a consistent execution environment for agents.

Kaggle is Google’s data science competition platform and online community for data scientists and ML practitioners. It serves as a hub for data science and machine learning enthusiasts.

OpenAI’s latest model, o1-preview, achieved quite notable results, performing at a medal-worthy level in 16.9% of the competitions. Still, it struggled with tasks requiring adaptability and creative problem-solving.

The o1 model family itself excels in reasoning abilities—perhaps among the best models for such task— with a mini version available. It’s currently available to use for ChatGPT paid users alongside the GPT-4o and the 4o-mini.

User forum

0 messages