New research from HKUST (arXiv:2606.14517, June 12) turns the agent safety layer into the attack surface.

What happened

Reasoning-based guardrails — the LLM safety layers that screen an agent's actions — can be trapped in their own analysis. Crafted inputs mimic the guardrail's internal schema (risk enumerations, assessment matrices), and the model, in the authors' words, "mechanically fills a template it has constructed for itself, trapped by its own instruction-following fidelity."

The measured effect: 13–63× token amplification in isolation, and 148× end-to-end latency in a LangGraph multi-agent deployment — a single guardrail call stretched to 730 seconds. Because the payload is fluent natural language, an injection classifier scored it below 0.001 probability and passed it through.

Why it matters