TL;DRAI

LLM-based judges for agent security fail because they share the same prompt injection vulnerabilities and produce non-deterministic decisions. For critical actions (prod databases, payments), security boundaries must use hard rules, not model sampling.

I'm in a group called AARM. It's a bunch of people trying to work out how you actually secure what an AI agent can do once it's running, and the basic idea is that the control has to sit right at the action. You check a tool call before it runs, and the agent can't wriggle around the check. So everyone in there already agrees that telling an agent "please don't" isn't a security model.

What gets me is that even in that room, I keep seeing people reach for an LLM to be the thing that makes the call. The agent goes to do something, you take that action and hand it to a second model, ask it whether it's fine, and whatever it answers is what happens. A model watching the model. I don't really get it, and I want to walk through why, because I think people lean on this without sitting with what it actually buys them.

What you're actually defending against

Go back to why you want a guard on the agent in the first place. It's there because the agent can be talked into things. Some prompt injection sitting in a page it reads, a tool result that quietly hands it a new instruction, a user who words a request just so. The agent is a thing you can reason with, and the worry is that the wrong person reasons with it.

dev.to

Don't use an LLM to decide what your AI agent is allowed to do

I'm in a group called AARM. It's a bunch of people trying to work out how you actually secure what an...

domenica 21 giugno 2026 New tab

TL;DRAI

1,016 words~5 min read

What you're actually defending against

Don't use an LLM to decide what your AI agent is allowed to do

Don't use an LLM to decide what your AI agent is allowed to do

Related reading

Why You Should Never Let an LLM Decide Your AI Agent's Permissions

Let your LLM take real-world actions — without giving it the last word

How to Lock Down an AI Agent Before It Goes Rogue

The LLM Is Not the Final Authority: Building Trust Infrastructure for AI Agents

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM…

Stop Using LLMs to Audit Other LLMs: You Are Bricking Your Production Latency

Related reading

Why You Should Never Let an LLM Decide Your AI Agent's Permissions

Let your LLM take real-world actions — without giving it the last word

How to Lock Down an AI Agent Before It Goes Rogue

The LLM Is Not the Final Authority: Building Trust Infrastructure for AI Agents

AprielGuard: A Guardrail for Safety and Adversarial Robustness in Modern LLM…

Stop Using LLMs to Audit Other LLMs: You Are Bricking Your Production Latency