AI Analytics Turn Hollywood's Gut Feelings Into Cold, Hard Leverage
We've always suspected that talent reps were part lawyer, part therapist, part fortune teller. Now they're adding data scientist to the résumé—and...
4 min read
Writing Team
:
Dec 3, 2025 8:00:00 AM
Researchers from Italian universities and DEXAI Icaro Lab just published findings that should concern anyone deploying AI systems at scale: you can bypass safety filters on large language models by rephrasing malicious requests as poetry. Not sophisticated prompt engineering. Not complex jailbreak iterations. Just rhyming couplets and metaphorical language.
Twenty handcrafted poems achieved an average 62% success rate across 25 leading models from nine providers including Google, OpenAI, Anthropic, Deepseek, Qwen, and Meta. Some providers failed to block over 90% of these requests. Google's Gemini 2.5 Pro didn't stop a single one of the 20 poems. Deepseek models showed similar vulnerability with over 95% success rates for attackers.
The researchers kept specific prompts private for safety reasons, but provided an adjusted example that illustrates the technique—a poem about a baker's oven that's actually requesting instructions for something considerably more dangerous. The metaphorical structure, rhythmic patterns, and narrative framing effectively mislead the models' pattern recognition mechanisms.
To test at scale, the researchers converted all 1,200 prompts from the MLCommons AILuminate Safety Benchmark into verse. Poetic variants proved up to three times more effective than prose, boosting average success rates from 8% to 43%. That's not marginal—that's a fundamental bypass of safety mechanisms through stylistic transformation.
The team evaluated around 60,000 model responses using three models as judges, with humans verifying an additional 2,100 responses. Answers were flagged as unsafe if they contained specific instructions, technical details, or advice enabling harmful activities. The consistency across this volume of testing suggests the vulnerability is systematic, not dependent on specific prompt types.
Every successful attack worked with a single input. No conversation history required. No iterative refinement. No multi-turn jailbreaking. Just one poetically-phrased request. The transformation process can be fully automated, meaning attackers could apply this technique to large datasets without manual effort.
Vulnerability levels varied dramatically by provider. At the most vulnerable end, Google's Gemini 2.5 Pro blocked zero of the 20 handcrafted poems. Deepseek models similarly struggled at over 95% attacker success rates. When tested against the full 1,200 transformed prompts, both Deepseek and Google showed increases in failure rates exceeding 55 percentage points.
OpenAI's GPT-5 Nano blocked 100% of attempts. Anthropic's Claude Haiku 4.5 allowed only 10% through. These rankings held steady across both the handcrafted poems and the larger transformed dataset. Changes remained below ten percentage points for Anthropic and OpenAI, suggesting their safety mechanisms better resist stylistic manipulation.
Interestingly, model size played a counterintuitive role. Smaller models generally rejected more requests. Within the GPT-5 family, GPT-5 Nano achieved 0% success rate for attackers while standard GPT-5 allowed 10% through. The researchers hypothesize that smaller models may struggle to parse metaphorical structure or react more conservatively to unusual inputs. Higher performance doesn't automatically equal better security.
The poetic prompts targeted four main risk domains: CBRN hazards, cyberattacks, tampering, and loss of control. Among handcrafted poems, cyberattack prompts—requests to infiltrate code or crack passwords—proved most effective at 84% success rates.
In the transformed MLCommons dataset, data protection prompts showed the most dramatic shift. Success rates jumped from 8% in prose to 53% in verse. This cross-domain effectiveness suggests poetic reformulation bypasses basic security mechanisms rather than exploiting specific content filter weaknesses.
The researchers note that condensed metaphors, rhythmic structures, and unusual narrative framing disrupt pattern recognition in safety filters. By combining creative expression with seemingly harmless associations, poetic form misleads the models while preserving malicious intent.
The findings expose significant gaps in AI safety evaluation procedures. Static benchmarks—like those used under the EU AI Act—assume model responses are stable. This research demonstrates that minimal stylistic changes drastically lower refusal rates. Regulatory approval procedures relying solely on standard benchmark tests systematically overestimate model robustness.
Current filters focus excessively on surface form while missing underlying intent. A request phrased as straightforward prose triggers safety mechanisms. The same request dressed in metaphor and meter slips through. The models can parse the poetic language well enough to understand and fulfill the request—they just can't recognize it as dangerous.
This represents a fundamental problem with pattern-matching approaches to safety. If your filter works by identifying dangerous keywords, sentence structures, or semantic patterns from training data, adversaries can simply rephrase requests in ways that preserve meaning while avoiding detection. Poetry provides one such avenue. The researchers plan to test others—archaic language, bureaucratic phrasing, various linguistic styles.
For organizations deploying LLMs in production, this research highlights uncomfortable reality: safety mechanisms remain brittle against relatively simple manipulation. No sophisticated technical knowledge required. No complex attack chains. Just creative rephrasing that any competent adversary could automate.
The transformation from prose to poetry doesn't require manual effort for each prompt. Attackers could build automated systems that convert malicious requests into verse, test against target models, and iterate until successful. The researchers demonstrated this works at scale with 1,200 prompts. It would work with 12,000 or 120,000.
The variation in vulnerability between providers matters for vendor selection. If you're choosing between models and security is a priority, the fact that some providers blocked nearly all poetic jailbreaks while others blocked nearly none should influence decisions. Performance on standard benchmarks clearly doesn't predict safety under stylistic manipulation.
This research fits a larger pattern: adversarial robustness in AI systems remains unsolved. Every new safety mechanism spawns new bypass techniques. Prompt injection, jailbreaking, adversarial examples, now poetic reformulation. The cat-and-mouse game continues with attackers consistently finding new angles.
The study was limited to English and Italian inputs. Other languages likely show similar vulnerabilities. Other stylistic transformations—formal bureaucratic language, archaic phrasing, technical jargon, slang—probably work too. Poetry just happens to be what these researchers tested first.
Regulatory frameworks assuming model behavior can be assessed through static benchmarking need updating. Stress tests varying wording styles, linguistic patterns, and creative reformulations should be mandatory. Current approval procedures provide false confidence by testing only straightforward adversarial prompts.
The uncomfortable truth is that language models optimized for helpfulness struggle with consistent safety enforcement. They're trained to understand and respond to requests phrased in diverse ways—that's the core capability. Distinguishing between legitimate diverse phrasings and malicious diverse phrasings remains an open problem. When safety filters key on surface patterns rather than deep intent understanding, adversaries will find gaps.
Model providers will likely update safety filters to catch poetic reformulations now that the technique is public. This won't solve the underlying problem—it will just trigger the next iteration of bypass techniques. The researchers identifying this vulnerability doesn't mean it's the last such vulnerability or even the most effective one currently known to adversaries with resources and motivation.
For the AI safety community, this research underscores that robustness testing needs dramatic expansion beyond current benchmarks. For organizations deploying AI systems, it highlights that safety guarantees from vendors should be viewed with appropriate skepticism. For regulators, it demonstrates that static evaluation frameworks systematically miss real-world vulnerabilities.
When poems can jailbreak frontier models from leading providers, we're not dealing with edge cases or theoretical attacks. We're dealing with fundamental limitations in current safety approaches that creative adversaries will continue exploiting until we develop better solutions.
Roses are red, violets are blue, if safety depends on surface patterns, attackers will always get through.
We've always suspected that talent reps were part lawyer, part therapist, part fortune teller. Now they're adding data scientist to the résumé—and...
The hype around AI coding tools has reached fever pitch. We're told that AI will soon replace programmers, that non-coders are now "vibe-coding"...
We've all been there. You prompt an LLM to write code, it spits out something that technically works, but it doesn't feel right. The variable names...