There is a paper that reframes prompt injection in a way that is hard to unsee: Prompt Injection as Role Confusion. Its claim is that the dozens of named attacks (ignore previous instructions, hidden HTML, markdown injection, tool injection, RAG injection) are not different bugs. They are one bug: a model attributes authority by the style of text, not by the structural role tag wrapped around it. Make untrusted text sound like it comes from a privileged source and the model may obey it, regardless of where it actually came from.
For a gateway sitting between an AI client and its MCP tools, that is the whole game. A tool response is supposed to be data. But it can mimic a higher-authority voice, and the model has no reliable way to tell the difference.
The strongest version: forging the reasoning channel
The paper's most striking result is not about user or system messages. It is about the model's own reasoning. When the authors injected text that imitated a model's chain-of-thought, the forged reasoning read with higher "CoTness" than the model's genuine reasoning. Concretely: forging the reasoning channel raised jailbreak success from roughly 0% to roughly 60%, and it transferred across every model they tested. When they "destyled" the injected reasoning (stripping the characteristic phrasing), success dropped back to about 10%.







