Anthropic spent 1,000 hours running an external red-team bounty before launching Claude Fable 5. The claim coming out of that program: no universal jailbreaks found. Within 48 hours of public release, a researcher known as Pliny the Liberator publicly claimed to have bypassed those guardrails anyway.
The techniques weren't exotic. They were a layered combination of Unicode/homoglyph substitution, long-context framing, narrative fiction framing, and a decomposition-recomposition strategy — breaking a harmful request into a series of individually innocuous-seeming sub-prompts. The use cases claimed were serious: drug synthesis assistance and attacks on crypto protocols.
This isn't an indictment of Anthropic specifically. It's a structural problem. Model-layer guardrails are a single point of failure, and they're always going to lose to researchers with enough time and creativity. The question is what you put in front of the model.
How the Attack Actually Worked
Based on what's been reported, Pliny combined at least four distinct evasion techniques simultaneously:











