I maintain bulwark-mcp, a small open-source proxy that sits between an MCP client (Claude Desktop, Cursor) and the servers it talks to, and scans tool results for indirect prompt injection before they reach the model.
The reason that's a job worth doing: an MCP-enabled agent reads the output of every tool it calls, and it reads that output as data. A file from disk, an issue body from GitHub, a row from a database, a search snippet from the web — it all flows straight into the model's context. Except sometimes it isn't data. Anyone with write access to one of those surfaces can plant text that looks like data and reads like instructions, and the model does what the text says.
Before telling anyone the detector works, I did the thing you're supposed to do with a security tool: I tried to defeat it. Most of what I threw at it, it caught. One category didn't — and the more I dug, the clearer it got that this isn't a regex I forgot to write. It's a wall the entire field is standing in front of.
Here's the attack, why it works, and what I think it means for anyone building injection defenses.
What the detector actually does






