3 min read

OpenAI Says Its Models Are Getting Scary Good at Hacking. Here's Their Plan.

OpenAI Says Its Models Are Getting Scary Good at Hacking. Here's Their Plan.
OpenAI Says Its Models Are Getting Scary Good at Hacking. Here's Their Plan.
5:20

OpenAI published a transparency update this week acknowledging what security researchers have been quietly discussing for months: their models are getting legitimately dangerous at cybersecurity tasks. Performance on capture-the-flag challenges jumped from 27% on GPT-5 in August 2025 to 76% on GPT-5.1-Codex-Max by November. That's not incremental improvement. That's a capability phase shift in under four months.

The company now expects future models to reach what they call "High" cybersecurity capability—meaning models that could develop working zero-day exploits against defended systems or assist with complex intrusion operations. Their response: a multi-layered defense strategy combining model training, detection systems, red teaming, and ecosystem partnerships.

The question isn't whether OpenAI is taking this seriously. They clearly are. The question is whether any of this will matter when a sufficiently motivated adversary decides to test it.

OpenAI's Layered Safeguards: Training, Detection, and Access Controls

OpenAI's approach relies on what they call "defense-in-depth"—training models to refuse harmful requests while staying useful for defenders, building detection systems to catch malicious activity, implementing access controls and egress monitoring, and partnering with red team organizations to stress-test the whole stack.

They're also launching Aardvark, an agentic security tool in private beta that scans codebases for vulnerabilities and proposes patches. It's already identified novel CVEs in open-source projects. They plan to offer free coverage to select non-commercial repositories, which is genuinely useful for the ecosystem.

Then there's the Trusted Access Program—tiered access to enhanced capabilities for qualifying users working on cyberdefense. OpenAI admits they're still figuring out where to draw the line between broad access and restricted features. Translation: they don't yet know which capabilities are safe to release widely and which require vetting.

New call-to-action

The Dual-Use Problem: Same Knowledge, Different Intentions

Here's where it gets messy. OpenAI acknowledges that "defensive and offensive cyber workflows often rely on the same underlying knowledge and techniques." You can't build a model that's brilliant at finding vulnerabilities for defenders without also making it brilliant at finding vulnerabilities for attackers. The difference is intention, and intention is notoriously difficult to verify at scale.

Their safeguards assume they can distinguish legitimate security research from malicious activity through behavioral signals, automated monitoring, and human review. Maybe they can. But security history is littered with examples of systems that worked perfectly until someone found the edge case nobody anticipated.

OpenAI is also establishing a Frontier Risk Council—experienced cyber defenders who'll advise on the boundary between useful capability and potential misuse. They're working with other labs through the Frontier Model Forum to develop shared threat models. These are smart moves. They're also moves that assume collaboration and transparency will scale faster than adversarial innovation.

What OpenAI Isn't Saying About AI Cyber Risk

What OpenAI doesn't address: how they'll handle state-sponsored actors with resources to bypass detection systems, exfiltrate model weights, or simply build competing models without safeguards. They acknowledge that "no system can guarantee complete prevention of misuse in cybersecurity without severely impacting defensive uses," which is honest. It's also a tacit admission that some level of offensive capability will leak.

The company emphasizes they're building systems that "assume change" and can "adjust quickly and appropriately." That's good engineering practice. It's not a guarantee. Cybersecurity doesn't reward good intentions—it rewards what actually works under adversarial pressure.

The Real Test: Will Defenders Get Enough Advantage?

OpenAI's stated goal is to give defenders "significant advantages" over attackers who are "often outnumbered and under-resourced." If Aardvark actually scales and helps patch vulnerabilities faster than new ones emerge, that's valuable. If the Trusted Access Program creates a meaningful capability gap favoring defense, that's useful. If their detection systems catch malicious activity before it causes damage, that's progress.

But all of this depends on execution under conditions OpenAI can't fully control. They're betting that layered safeguards, ecosystem collaboration, and continuous iteration will stay ahead of threat actors who only need to find one gap. History suggests that's an optimistic bet.

The honest assessment: OpenAI is doing more than most labs to address dual-use cyber risks, and their transparency about capability growth is commendable. Whether their safeguards will hold against determined adversaries is a different question—one we won't know the answer to until someone actually tries.

If you're evaluating AI security risks for your organization and need help separating theater from substance, Winsome's growth team can walk you through what actually matters.

When AI Goes to Confession: OpenAI's Truth Serum for Misbehaving Models

When AI Goes to Confession: OpenAI's Truth Serum for Misbehaving Models

OpenAI just published research on teaching AI models to confess their sins. Not the human kind—the algorithmic kind. The kind where a model secretly...

Read More
GPT-5.2 Solves an Open Math Problem. Now What?

GPT-5.2 Solves an Open Math Problem. Now What?

OpenAI announced this week that GPT-5.2 Pro solved an open research problem in statistical learning theory without human scaffolding. Not "helped...

Read More
Garlic: The Model OpenAI Hopes Will Make You Forget They Panicked

Garlic: The Model OpenAI Hopes Will Make You Forget They Panicked

Let's talk about what happens after the panic button gets pressed. Last week, Sam Altman declared code red. This week, leaked internal briefings tell...

Read More