What is MLPerf? Understanding AI’s Top Benchmark
A constantly evolving set of real-world AI tests pushes Intel experts to boost performance, level the playing field and make AI more accessible to all.
MLPerf is the set of AI benchmarks you’ve heard of but might not (yet) fully comprehend.
While I could find no unequivocal definition of the word itself – it was analogized as “SPEC for ML” in its May 2018 public debut – my handy AI assistant has a confident answer. It reads: “The name ‘MLPerf’ is a portmanteau derived from ‘ML’ for machine learning and ‘Perf’ for performance.”
It has more to say: “While there isn’t a detailed public story about the naming process, the name itself is quite descriptive and was likely chosen for its straightforward representation of the benchmarks’ purpose.” (Just the kind of supportive answer you’d expect from an AI built and refined by AI researchers.)
Real results illustrate the point: Just last week, Intel continued its run as the only vendor to submit server CPU results to MLPerf. These included the performance of common AI tasks like image inspection and information analysis using Intel® Xeon® 6 processors.
The People and Processes Behind the AI Playoffs
“MLPerf is the No. 1 benchmark for AI right now,” says Ramesh Chukka, who works in Intel’s Data Center and AI software group.

Ramesh Chukka is an AI software engineering manager in Intel’s Data Center and AI software group and a member of the MLCommons consortium board.
Chukka represents Intel on the board of MLCommons, a consortium that formed late in 2020 to expand the initial MLPerf effort to “advance development of, and access to, the latest AI and machine learning datasets and models, best practices, benchmarks and metrics.”
MLPerf refers to all the benchmarks themselves, which “evolve pretty quickly, as the technology does,” Chukka says, meeting that mission to advance the field with “rapid prototyping of new AI techniques.” Each benchmark measures how fast a particular AI job – given a set level of quality – can be completed.
The benchmarks are split into two major categories: training, where AI models are built using data; and inference, where AI models are run as applications. To frame it with a large language model, aka LLM: Training is where the LLM learns from a corpus of information, and inference happens every time you ask it to do something for you.
MLCommons publishes two sets of benchmark results a year for each of the two categories. For example, Intel most recently shared training results last June and inference results this month.
Intel AI experts have contributed to MLPerf (and thus MLCommons) since day one. Intel’s participation has always been twofold: helping shape and evolve the entire effort, while compiling and contributing benchmark results using Intel processors, accelerators and solutions.
The Problems MLPerf Benchmarks Solve
AI models are complicated programs, and a wide and growing variety of computers can run them. The MLPerf benchmarks are designed to allow for better comparisons of those computers while simultaneously pushing researchers and companies to further the state-of-the-art.
Each benchmark is intended to be as real-world-representative as possible, and results land in one of two divisions. The “closed” division controls for the AI model and software stack to provide for the best possible hardware-to-hardware comparisons. In other words, each different system uses the same application to achieve the same outcome (say, an accuracy measure for natural language processing).
The “open” division allows for innovation — each system achieves the same desired outcome but can push the performance envelope as far as possible however it can.
What’s admirable about MLPerf is that everything is shared and benchmarks are open sourced. Results need to be reproducible — no mystery can remain. This openness allows for more dynamic comparisons beyond raw side-by-side speed, like performance-per-power or cost.
How MLPerf Works and Evolves
As Chukka mentioned, MLPerf retains its prominence in part by continuously evolving and adding new benchmarks. How that process happens is largely through open debate and discussion among the MLCommons community, which spans large companies, startups and academia.
New benchmarks are proposed and debated, and then those approved need an open dataset for training — which or may not exist. Contributors volunteer to team up and build the benchmark, identify or gather data, and set a timeline for the benchmark’s release.
Any company that wants to publish results needs to meet a deadline for the next release. If they miss it, they wait for the next round.
What the World Gets from Faster, More Efficient AI
While having more people in the world solving more problems using semiconductors has an obvious big picture benefit to Intel (not to mention more grist for the sales and marketing mill), there are other benefits to Intel’s participation in MLPerf.
Intel is always contributing to open source frameworks for AI, such as PyTorch and extensions to it. As Intel engineers make improvements to that code through their work to speed up MLPerf results, everybody running those types of AI on Intel chips downstream gains those improvements without lifting a finger.
“For new benchmarks, we always look at optimizations we can do,” Chukka says, “and we keep looking for the next couple of submissions.”
Chukka’s team pulls in helpers from across the company to build and improve Intel’s results, sometimes achieving dramatic improvements from round to round (like an 80% improvement on a recommender inferencing result in 2024 and a 22% gain in the GPT-J benchmark this month).
So, every time you hear that Intel has published a new round of MLPerf results, you can delight that all kinds of AI systems just got faster and more efficient. Maybe even your favorite LLM, giving you quicker and more clever answers with every new prompt.
Jeremy Schultz is a writer and editor with Intel’s Global Communications and Events Team.
Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. Visit MLCommons for more details. No product or component can be absolutely secure.