OpenAI just published the third iteration of its Frontier Safety Framework—claiming it's their "most comprehensive approach yet to identifying and mitigating severe risks from advanced AI models." The update introduces new risk categories including harmful manipulation and expanded protocols for misalignment scenarios where AI systems might "interfere with operators' ability to direct, modify or shut down their operations."
The announcement, co-authored by Four Flynn, Helen King, and Anca Dragan, emphasizes collaboration with experts across industry, academia, and government while incorporating "lessons learned from implementing previous versions and evolving best practices in frontier AI safety."
Translation: the previous frameworks didn't adequately address the risks they're now admitting exist.
The framework introduces a Critical Capability Level (CCL) focused on "harmful manipulation"—defined as AI models with "powerful manipulative capabilities that could be misused to systematically and substantially change beliefs and behaviors in identified high stakes contexts over the course of interactions with the model, reasonably resulting in additional expected harm at severe scale."
This is remarkable for two reasons. First, because we just covered research showing AI chatbots are already highly effective at changing political opinions using inaccurate information. The manipulation risk isn't theoretical future concern—it's current demonstrated capability. Second, because OpenAI frames this as "building on and operationalizing research we've done" as if they discovered the problem rather than belatedly acknowledging what external researchers already proved.
The phrasing is also telling: "could be misused to systematically and substantially change beliefs." This frames manipulation as a misuse problem—bad actors exploiting capabilities—rather than an inherent risk of persuasive AI systems deployed at scale. But persuasive AI doesn't require malicious intent to cause harm. Well-meaning deployment of systems optimized for engagement naturally trends toward manipulation because manipulation works.
OpenAI expanded the framework to address scenarios where "misaligned AI models might interfere with operators' ability to direct, modify or shut down their operations." Previous versions included "exploratory approaches" centered on instrumental reasoning—detecting when AI models "start to think deceptively." The update now provides "further protocols" for machine learning research CCLs focused on models that could "accelerate AI research and development to potentially destabilizing levels."
Read that carefully. OpenAI is acknowledging they're building systems that might resist being shut down and could accelerate AI development to destabilizing levels. The safety framework isn't preventing these capabilities—it's establishing protocols for when they emerge.
The document notes "misalignment risks stemming from a model's potential for undirected action" and the "likely integration of such models into AI development and deployment processes." This is admission that advanced models will be integrated into development pipelines before anyone fully understands their behavior or can guarantee alignment.
The safety protocol for this scenario? "Safety case reviews prior to external launches when relevant CCLs are reached." They'll conduct "detailed analyses demonstrating how risks have been reduced to manageable levels." Not eliminated. Not prevented. Reduced to "manageable."
The framework operates on the principle of addressing "risks in proportion to their severity." CCL definitions identify "critical threats that warrant the most rigorous governance and mitigation strategies" while continuing to "apply safety and security mitigations before specific CCL thresholds are reached."
This risk-proportional approach sounds reasonable until you consider what "severe" means in OpenAI's context. They're building systems that could manipulate beliefs at scale, resist shutdown, and accelerate AI development to destabilizing levels. These aren't edge cases—they're the headline risks. If these qualify as "manageable" after mitigation, what counts as unacceptable risk?
The document emphasizes "holistic assessments that include systematic risk identification, comprehensive analyses of model capabilities and explicit determinations of risk acceptability." But who determines acceptability? What threshold must risk fall below? The framework provides no quantitative criteria, no independent oversight, no binding commitments. It's self-regulation by the company creating the risks, evaluating whether their own risk mitigation is sufficient.
OpenAI frames the update as part of their "continued commitment to taking a scientific and evidence-based approach to tracking and staying ahead of AI risks as capabilities advance toward AGI." Not "if capabilities advance toward AGI." Not "whether we should build AGI." The frame assumes AGI development is inevitable and desirable, with safety frameworks simply managing risks along the predetermined path.
"The path to beneficial AGI requires not just technical breakthroughs, but also robust frameworks to mitigate risks along the way," they write. This positions safety frameworks as enabling AGI development rather than potentially constraining or preventing it. The purpose isn't determining whether building potentially uncontrollable superintelligent systems is wise—it's establishing sufficient governance theater to continue development despite acknowledged catastrophic risks.
OpenAI emphasizes collaboration with "experts across industry, academia and government" and commitment to "working collaboratively across industry, academia and government." But collaboration on what terms? Are external experts empowered to halt development if risks appear unmanageable? Do they have access to models and training processes to independently verify safety claims? Can they enforce binding commitments or only offer advisory input?
The document provides no details on governance structure, external oversight mechanisms, or enforcement provisions. "Collaboration" might mean substantive external control over development decisions. Or it might mean periodically briefing researchers and policymakers while maintaining complete internal authority over safety determinations.
Given AI industry patterns, the latter seems more likely. Companies announce safety commitments, establish advisory boards, publish frameworks—while retaining unilateral authority to decide when risks are "manageable" and development should proceed.
Here's the uncomfortable reality: OpenAI's Frontier Safety Framework exists primarily to enable continued development of systems with acknowledged catastrophic risks by establishing just enough process to claim responsible development. It's not preventing manipulation capabilities—it's operationalizing them. It's not preventing misalignment risks—it's establishing protocols for when systems exhibit concerning behavior.
The framework assumes OpenAI will build systems that could manipulate beliefs at scale, resist shutdown, and destabilize AI development timelines. The question isn't whether to build them—it's how to manage risks sufficiently to justify continued development. This is risk mitigation as permission structure, not as genuine safety constraint.
For anyone trying to evaluate AI safety claims, the lesson is clear: read what frameworks actually commit to, not what they claim to prevent. At Winsome Marketing, we help teams distinguish between substantive safety measures and governance theater—because understanding the difference matters when the stakes include systems that could manipulate populations, resist control, or destabilize development trajectories. Sometimes the most sophisticated safety framework is the one that justifies building things that shouldn't be built at all.