Most agent failures aren't reasoning failures. They're policy failures. The model picked the right tool, then called it with arguments outside its scope. The delegation chain expanded one step beyond what the user actually authorized. The output cleared the LLM's own check but tripped the compliance rule three layers down.

These don't get fixed by a larger frontier model. They get fixed by a faster check, run more often, on a model small enough that running it constantly isn't a budget event.

That's the tier most agent stacks don't have. And it's the tier Gemma 4 finally fills.

The missing tier

When you sketch an agent system on a whiteboard, you draw one box for the reasoning model. In production you discover you need a lot more boxes around it: pre-flight checks before each tool call, scope verifiers in delegation chains, output classifiers feeding audit trails, intent disambiguation when the user's last message could mean two things.