4 min read
OpenAI Benchmarks Don't Measure Real AI Performance
Writing Team
:
Jun 11, 2026 12:00:00 AM
An OpenAI researcher basically admitted that all those benchmark scores we use to compare AI models are kind of useless for the newest, most advanced systems.
Actually, it's a bigger problem than that. The benchmarks we have were designed to test earlier AI capabilities - things like reading comprehension, basic math, simple reasoning. But frontier models like GPT-4 and Claude can do stuff that doesn't fit into those neat little test categories.
Key Points
- OpenAI researcher Noam Brown argues that current AI benchmarks increasingly understate frontier model capabilities because performance is becoming heavily dependent on inference compute at test time, rather than on the underlying model itself
- Brown's proposed fix: replace single-number scores with performance-vs-compute plots, with tokens, cost, or wall-clock time on the x-axis
- He pointed to GPT-5.5 as a concrete example — initial reactions were muted because benchmark improvements over GPT-5.4 appeared small, but those scores didn't reflect what the model could do given more inference budget
- Stanford's 2026 AI Index found that frontier models gained 30 percentage points in a single year on Humanity's Last Exam — a benchmark specifically designed to resist AI saturation — and that evaluations intended to be challenging for years are being saturated in months
- One response on X noted that Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute scaling when determining whether a model crosses a safety threshold
The number everyone uses to compare AI models may be measuring the wrong thing. And the person saying so works at the lab with the most to gain from the current system.
What Brown Said
Noam Brown is not a peripheral voice in this conversation. He co-created Libratus and Pluribus, the AI systems that beat professional poker players, and is credited on the o1 launch and system card — the model that introduced test-time compute scaling to mainstream AI deployment. When he says benchmarks are broken, it's worth understanding what he means precisely.
In a post on X, Brown said benchmark results are increasingly shaped by the amount of tokens, money, or wall-clock time spent on a task. As a result, single-number benchmark scores may no longer accurately capture the ceiling of modern AI systems' capabilities.
The underlying mechanics: modern reasoning models don't just generate an answer — they think before answering, running internal chains of reasoning that consume tokens and time before producing output. The more inference budget you give a model, the longer and more carefully it can reason. That means two runs of the same model on the same benchmark can produce meaningfully different scores depending on how much compute was allocated. A single number erases that variable entirely.
Brown's proposed solution is to replace scalar benchmark scores with performance-vs-inference-compute plots — performance on the y-axis, tokens, cost, or wall-clock time on the x-axis — which would show how a model improves as you spend more, rather than collapsing everything to a single point.
Why This Is a Real Problem, Not a Methodological Footnote
Brown has noted that after 100 million tokens, performance was still going up on at least one evaluation, writing on LinkedIn: "What we're seeing here is not the capability ceiling." That has direct implications for how AI progress is being read.
When GPT-5.5 launched, coverage was initially tepid because benchmark gains over its predecessor looked incremental. But those scores didn't capture what the model could do given more inference budget. The model wasn't underperforming — the measurement was underpowered. The muted reaction was a measurement artifact, not a capability signal.
The Stanford 2026 AI Index captures the broader pattern: evaluations intended to remain challenging for years are being saturated in months, compressing the window in which any given benchmark remains useful for tracking progress. The field is running through benchmarks faster than it's creating new ones. And now Brown is arguing that even the benchmarks still standing may be misreporting what models can actually do. Stanford
Research on multi-step cyber tasks has already documented this compute-dependence clearly: different models require vastly different token budgets to reach the same performance milestone, and some models plateau entirely regardless of how much inference compute is added, while others continue improving. A flat scalar score hides all of that.
The Safety Dimension No One Is Talking About Loudly Enough
One response to Brown's post made an observation that deserves more attention than it received: Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute scaling when determining whether a model crosses a safety threshold. Digg
This is not an abstract concern. Labs use benchmark thresholds to decide when a model requires additional safety review before deployment. If those thresholds are defined against scalar scores that don't reflect what a model can do at higher inference budgets, a model could pass a safety evaluation at standard inference and then exhibit significantly different capabilities when given more compute. The evaluation and the deployment condition would be measuring different things.
Brown has previously noted that the recipe behind today's frontier reasoning models closely follows the AlphaGo playbook: imitate human data, scale inference compute to reason better, then use reinforcement learning to go beyond imitation. The scaling at inference time is not a feature of specific models — it's structural to how these systems work. Benchmarks that treat inference budget as fixed are benchmarking a different system than the one being deployed.
What This Means if You're Evaluating AI Tools
For marketing leaders and growth teams using AI in their operations, the benchmark problem has a practical translation. When a vendor or a tech publication tells you Model X scored Y on benchmark Z, that number now carries an invisible asterisk: at what inference cost? With how many tokens? Under what time constraints?
This doesn't mean benchmark scores are useless. It means they're a floor, not a ceiling. A model that scores modestly on a standard benchmark may perform considerably better when given more budget to think. That gap matters when you're choosing tools, setting expectations for what a model can and can't do, and evaluating whether the performance you're seeing in production reflects the model's actual capability or just the default inference settings.
The practical upshot is that evaluation loops inside your organization matter more than the launch-day benchmark table. How a model performs on your tasks, with your context, at the compute budget you're actually running — that's the number worth measuring. Our AI strategy and services team helps organizations build evaluation frameworks grounded in production performance rather than press release scores. If you want to understand what AI tools actually do in your workflow, that conversation starts here.
For ongoing coverage of what the benchmark numbers are actually saying — and what they're not — follow A-Eye Spy on the Winsome Marketing blog.

