4 min read

CritPt Benchmark: No AI Model Scores Above 9% on Graduate-Level Physics Problems

CritPt Benchmark: No AI Model Scores Above 9% on Graduate-Level Physics Problems
CritPt Benchmark: No AI Model Scores Above 9% on Graduate-Level Physics Problems
8:47

We just got a brutal reality check on how far AI still has to go.

Artificial Analysis launched CritPt (Complex Research using Integrated Thinking - Physics Test), a new benchmark where no model achieves greater than 9% accuracy. Not 90%. Not 50%. Nine percent. Google's new Gemini 3 Pro Preview—currently the strongest performer—manages 9.1%. Many models fail to solve a single problem even given five attempts.

This isn't a trick question exam or deliberately adversarial testing. It's graduate-level physics research—the kind of problems a capable junior PhD student could tackle as a standalone project. And the world's most advanced AI systems are essentially failing.

What CritPt Actually Tests

Developed by over 60 researchers from more than 30 institutions including Argonne National Laboratory and University of Illinois Urbana-Champaign, CritPt represents a fundamentally different category of AI evaluation.

The benchmark includes 70 end-to-end research problem challenges covering 11 physics subdomains: condensed matter, quantum physics, AMO (atomic, molecular, and optical physics), astrophysics, high energy, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics, and biophysics.

Each challenge is designed to be feasible for a capable junior PhD student as a standalone project but unseen in publicly available materials. This distinction matters critically—models can't pattern-match against training data or rely on memorized solutions from arXiv papers or physics textbooks.

The problems require deep understanding and genuine reasoning in frontier physics. They test whether AI can actually do research-level work, not just summarize existing knowledge.

The Leaderboard Reality

The results are humbling:

  • Gemini 3 Pro Preview (Google): 9.1%
  • Perplexity R1-91B: 4.9%
  • X.AI Grok 3: 2.9%
  • Gemini 2.0 Pro: 2.6%
  • Anthropic Claude 3.7 Opus: 2.6%
  • Grok 4: 2.0%

Everything below that drops to 1.4% or lower. Multiple models—including versions of DeepSeek, Claude, Alibaba Qwen, and others—score 0.0%. Complete failure to solve even a single problem correctly.

This stands in stark contrast to the near-perfect scores models achieve on standardized tests like MMLU or even challenging benchmarks like GPQA. On those evaluations, frontier models routinely hit 85-95% accuracy. CritPt reveals that margin between "very good at answering questions" and "capable of doing actual research" is enormous.

Why This Benchmark Matters

CritPt differentiates itself from other reasoning tests in several critical ways:

True frontier evaluation

Questions and answers are written and tested by experts—postdocs and physics professors—in their specific subfields. This isn't undergraduate material packaged as "hard." It's actual research-grade physics.

Reflective of research assistant capabilities

The explicit design goal was modeling what a competent research assistant could accomplish. If AI is going to accelerate scientific discovery, it needs to perform at this level. The benchmark shows we're nowhere close.

Independently verified and feasible

Every problem has been confirmed solvable by human experts. The low AI scores aren't because the questions are impossible—they're because current models lack the reasoning depth required.

No tool use allowed in initial testing

Models are evaluated on pure reasoning capability without access to calculators, simulators, or external knowledge retrieval. This tests genuine understanding rather than orchestration skills.

New call-to-action

What the Scores Actually Mean

A 9.1% accuracy rate means Gemini 3 Pro solved approximately 6-7 problems out of 70. Given five attempts per problem (a generous allowance), many models couldn't crack even one.

For context, a human PhD student wouldn't score 100% on CritPt either—these are challenging problems even for domain experts. But a competent graduate researcher would likely score 30-50%, potentially higher in their specialization area.

The gap between 9% and 30% represents the difference between "occasionally gets lucky" and "demonstrates consistent research-level capability." That gap is massive.

The Sobering Implications

This benchmark arrives amid intense hype about AI's reasoning capabilities. We've seen impressive results on math olympiads, coding challenges, and standardized tests. Companies are positioning their models as research assistants, scientific discovery tools, and intellectual collaborators.

CritPt suggests we should pump the brakes on that narrative.

The problems models excel at—even difficult ones like competitive programming or theorem proving—tend to have clear problem structures, well-defined solution spaces, and abundant training examples of similar problems. Graduate-level physics research has none of those properties.

Real research requires:

  • Synthesizing knowledge across multiple subdomains
  • Identifying which theoretical frameworks apply to novel situations
  • Making approximations and simplifying assumptions appropriately
  • Recognizing when standard approaches won't work
  • Generating creative problem-solving strategies

Current models demonstrate flashes of these capabilities on narrow tasks. CritPt shows they can't sustain them across research-grade problems.

The Benchmark's Pedigree

The credibility behind CritPt matters. The development team includes researchers who previously worked on SciCode and SWE-Bench—two benchmarks that have become standard references for evaluating AI coding and scientific reasoning capabilities.

The institutional backing—Argonne National Laboratory, UIUC, and dozens of other research institutions—signals this isn't a vanity project or marketing exercise. It's a serious attempt by the research community to measure whether AI can actually contribute to frontier science.

The fact that they're publishing results showing single-digit accuracy takes courage. It would have been easier to tune the difficulty until models looked impressive. Instead, they calibrated to "feasible for a PhD student" and let the chips fall where they may.

What This Means for AI Development

For AI labs, CritPt identifies a clear capability gap. Frontier models have made remarkable progress on reasoning tasks with clear structure and feedback signals. Physics research—particularly the open-ended, creative aspects—remains largely out of reach.

The path forward likely requires:

  • Better long-context reasoning that maintains coherence across multi-step derivations
  • Improved physical intuition and domain knowledge integration
  • More sophisticated uncertainty quantification (knowing what you don't know)
  • Enhanced ability to verify solutions and catch errors before committing

Some of these are active research areas. Others remain unsolved problems in AI safety and alignment.

The Research Assistant Reality Check

Many organizations are deploying AI tools marketed as "research assistants" or "scientific discovery platforms." CritPt suggests we should be honest about their current limitations.

AI can absolutely accelerate research in narrow ways: literature review, code debugging, routine calculations, hypothesis generation. But replacing a capable PhD student? We're not remotely there yet.

The 9% benchmark makes this quantifiable. When hiring research assistants, you don't settle for someone who solves fewer than one problem in ten. You shouldn't accept that from AI tools either—at least not if you're claiming they can do research-level work.

The Bottom Line

CritPt does what the best benchmarks do: It reveals capability gaps we didn't know existed and couldn't measure before. Frontier AI models are genuinely impressive on many tasks. Graduate-level physics research is not yet one of them.

The single-digit scores aren't embarrassing—they're clarifying. They tell us where to focus development effort. They help calibrate expectations about AI's current role in scientific research. They provide a target for measuring progress.

And when a model finally breaks 50% on CritPt, we'll know something fundamental has changed about AI's reasoning capabilities. Until then, the physics PhDs can rest easy. Their jobs remain secure.


If you're evaluating AI tools for research, scientific applications, or complex reasoning tasks and need guidance on realistic capability assessment versus marketing claims, Winsome Marketing's team can help you separate signal from hype.

Nano Banana Pro = AI Image Generation Remembers Physics Class

Nano Banana Pro = AI Image Generation Remembers Physics Class

Google wants you to believe image generation just grew up.

Read More
Beyond Chatbots: Why World Models Are AI's Real Next Act

Beyond Chatbots: Why World Models Are AI's Real Next Act

While everyone's debating whether ChatGPT can write better emails, the real AI revolution is quietly building virtual snow globes in corporate R&D...

Read More
Bezos Wants AI That Builds Spaceships, Not Just Tweets

Bezos Wants AI That Builds Spaceships, Not Just Tweets

Jeff Bezos is bored of chatbots. So he's spending $6.2 billion to build AI that doesn't just talk—it makes things.

Read More