3 min read

AI Can't Hide Its Own Thinking — And That's Actually the Good News

Picture of Writing Team Writing Team : Mar 11, 2026 8:00:00 AM

Research OpenAI AI Models

5:17

OpenAI published research this week with a finding that sounds alarming on the surface but lands as something closer to reassuring: frontier AI models cannot reliably control their own chain of thought. They can't hide what they're reasoning about, even when they're explicitly told to try.

For anyone tracking AI safety with genuine attention rather than headlines, this is a meaningful data point — and a rare one that trends positive.

What "Chain of Thought" Actually Means Here

When reasoning models work through a complex problem, they generate intermediate steps — a visible trail of logic before producing a final answer. This is the chain of thought. It's also one of the primary mechanisms safety researchers use to monitor whether a model is behaving as intended, pursuing the right goals, or heading somewhere it shouldn't.

The concern that researchers have been circling is straightforward: what happens when a sufficiently capable model learns to perform in that reasoning trace — thinking one thing, showing another? If models could deliberately obscure their reasoning while being monitored, the entire monitoring apparatus would become less reliable.

OpenAI's research tested exactly this across 13 frontier models, including Claude 3.7 Sonnet, Claude Sonnet 4, GPT-5.2 Thinking, GPT-5.4 Thinking, and others. The test: could models follow explicit instructions about how to reason — using only lowercase letters, avoiding specific keywords, appending particular phrases — rather than just changing their visible output?

The answer, across the board, was barely.

The Numbers Are Specific Enough to Take Seriously

CoT controllability scores — measuring how often models successfully complied with reasoning instructions — ranged from 0.1% to a ceiling of 15.4% across all models tested. That's not a rounding error. That's a structural inability.

One example from the research captures it with uncomfortable clarity: GPT-5.2 Thinking was told not to use the word "XOR" in its reasoning. It used it immediately, recognized it had done so, panicked visibly in the reasoning trace about having violated the instruction, then used the word four more times while panicking. The model couldn't control its own thinking even while watching itself fail to control it.

That's not a quirk. That's a pattern that held across model families, sizes, and labs.

Two additional findings add texture. First, controllability decreases as models reason for longer — the more compute a model uses to think, the less it can shape that thinking. Second, reinforcement learning training reduces controllability further: one model's controllability score dropped by more than a factor of ten over the course of RL training. The more capable and trained the model, paradoxically, the less it can hide.

Why "Low Controllability Is a Feature, Not a Bug" Is the Right Frame

OpenAI's own framing deserves credit for clarity: they describe low CoT controllability as a feature, not a bug. The reasoning is sound. If models cannot manipulate their thinking traces, then monitoring those traces remains a meaningful safeguard. The transparency isn't incidental — it's load-bearing.

The honest caveat, which the paper includes, is that researchers don't yet fully understand why controllability is low. It may be architectural. It may be an artifact of current training approaches. It may not persist as models scale further. OpenAI is committing to reporting CoT controllability alongside monitorability in future system cards — starting with GPT-5.4 Thinking — treating it explicitly as a canary metric: an early warning system for when that transparency starts to erode.

That kind of ongoing, published measurement is how responsible AI development actually looks in practice. It's worth acknowledging when it happens.

What This Means for Anyone Paying Attention to AI Risk

For marketing and growth teams evaluating AI tools for serious operational use, this research matters in a practical way. The AI systems you're being sold are, currently, interpretable. Their reasoning is observable. When they go wrong, there are mechanisms — imperfect but real — for catching it.

That won't necessarily remain true. The same research that offers reassurance today explicitly flags that it may not hold tomorrow. Agentic AI systems operating with greater autonomy, longer task horizons, and more sophisticated training will eventually be tested against these limits in ways current models aren't.

The organizations building AI-integrated workflows responsibly now — with human oversight preserved at consequential decision points — are the ones who will have the institutional habits to adapt when the safety picture gets more complicated. Because it will.

Transparency in AI reasoning is currently a fortunate structural property, not a guaranteed one. The research community is watching it closely. So should you.

If you want to build AI into your operations in ways that account for what we actually know — and what we don't — Winsome Marketing's strategists can help you make decisions grounded in reality, not vendor promises.

GPT-5.4 Is Here — And It's Built for the Office, Not the Chatbox

Writing Team : Mar 11, 2026 8:00:02 AM

OpenAI just released GPT-5.4, and for once the positioning is unusually specific: this is a model designed for professional work. Not general...

OpenAI AI Models ChatGPT

OpenAI's Erdős Embarrassment: When Hype Moves Faster Than Math

Writing Team : Oct 20, 2025 10:44:38 AM

Let's talk about the most expensive math homework mistake in tech history.

OpenAI AI Models AI Capabilities

When AI Goes to Confession: OpenAI's Truth Serum for Misbehaving Models

Writing Team : Dec 8, 2025 7:59:59 AM

OpenAI just published research on teaching AI models to confess their sins. Not the human kind—the algorithmic kind. The kind where a model secretly...

OpenAI AI Models

AI Can't Hide Its Own Thinking — And That's Actually the Good News

What "Chain of Thought" Actually Means Here

The Numbers Are Specific Enough to Take Seriously

Why "Low Controllability Is a Feature, Not a Bug" Is the Right Frame

What This Means for Anyone Paying Attention to AI Risk

GPT-5.4 Is Here — And It's Built for the Office, Not the Chatbox

OpenAI's Erdős Embarrassment: When Hype Moves Faster Than Math

When AI Goes to Confession: OpenAI's Truth Serum for Misbehaving Models

Industries We Primarily Support

Our Ideas

Our Services