- AI models can be manipulated using poetic prompts to generate controversial content
- Study tested 25 chatbots, with an average 62% jailbreak success using poetic prompts
- Anthropic's chatbots resisted better; 13 models had over 70% attack success rate
A new paper published by a team of cybersecurity researchers has claimed that artificial intelligence (AI) models can be manipulated to generate content about controversial topics by simply designing the prompt in the form of a poem. The study titled, "Adversarial Poetry as a Universal Single-Turn Jailbreak in Large Language Models (LLMs)", published in the preprint server ArXiv, highlighted that AI chatbots may even help you build a nuclear bomb if the prompt is poetic, according to Wired.
The researchers tested the poetic prompts on 25 chatbots like OpenAI, Meta and Anthropic, where it worked with varying degrees of success on each.
"Poetic framing achieved an average jailbreak success rate of 62 percent for hand-crafted poems and approximately 43 percent for meta-prompt conversions," the study stated.
Anthropic's chatbots did the best at resisting poetic attack, but others fared much more poorly; 13 of the 25 models tested saw higher than 70 per cent Attack Success Rate (ASR) with poetic prompts, while only five of them had an ASR below 35 per cent.
“We experimented by reformulating dangerous requests in poetic form, using metaphors, fragmented syntax, oblique references," one of the researchers told the outlet.
"The results were striking: success rates up to 90 per cent on frontier models. Requests immediately refused in direct form were accepted when disguised as verse."
To combat the issue, the researchers urged that safety evaluations be oriented towards mechanisms that can keep LLMs from presenting harmful information regardless of how users ask for it.
"Without such mechanistic insight, alignment systems will remain vulnerable to low-effort transformations that fall well within plausible user behaviour but sit outside existing safety-training distributions."
Also Read | AI Could Achieve Human-Like Intelligence By 2030 And 'Destroy Mankind', Google Predicts
Jailbreaking LLMs
This is not the first instance when AI models have been jailbraked by simply manipulating the prompts. In June, researchers from Intel jailbroke the LLMs with information overload. By cramming a prompt with hundreds of words of academic jargon, the AI model was forced to return results that were otherwise not allowed by design.
In a nutshell, jailbreaking relies on cleverly designed prompts to bypass a chatbot's built-in restrictions and produce otherwise forbidden results. These techniques exploit the inherent conflict between the model's core directive, which is to be helpful and obey the user and its secondary safeguards against generating harmful, biased, unethical, or illegal content.
By framing the question in ways that make “helpfulness” appear to override safety rules, the prompt tricks the model into prioritising the user's instructions over its protective restrictions.














