The moment you connect an MCP server, your coding agent stops being a thing that reads and writes in your repo and becomes a thing that can reach out and act. Read a database, hit an API, touch a service, pull in a web page. That's the entire appeal. It's also the entire problem, and the two are the same feature seen from two sides.

I went through this wiring up tools for my own plugin work, and the thing that saved me from a worse mistake was a scar I already had. I'd shipped an AI chatbot earlier where I rendered the model's output straight to the page and ate an HTML injection bug. That one taught me a rule the hard way: anything an LLM hands back is untrusted input. MCP is that same lesson with the blast radius turned up, because now the untrusted thing isn't just my model's text, it's whatever a connected server decides to return.

Two different fears, and people only have one

When people talk about agent safety, they almost always mean: what if the agent runs something destructive. Deletes files, force-pushes, curls something it shouldn't. That fear is real and it has a real answer, which I'll get to.

But it's only half the threat, and it's the visible half. The other half is quieter: what if the agent believes something it shouldn't. An MCP server returns an API response, a database row, an issue body, an email thread, and somewhere in that returned content is a line that reads like an instruction. The model has no reliable way to tell your instruction from text that arrived inside data. To it, they're the same channel. So the danger isn't only the command the agent runs, it's the command it gets talked into running by content that came back through a tool.