For thirty years the math has favored the attacker. He needs one bug. You have to cover everything, forever, on a smaller budget with a tired SOC. Now both sides get an AI multiplier, and the only question that matters is who gets it first and biggest. OpenAI's answer is a design pattern worth stealing: stop reading intent from the prompt, read it from the user.

The problem: guardrails resolve on shape, not intent

Here's the failure mode anyone doing defensive work against a frontier model already knows. You ask the model to build a proof-of-concept from a published CVE so you can validate that your patch holds. You own the box. You're confirming a fix. The model tells you it can't help you write an exploit.

It's not reading your heart. It's reading your tokens. And the token sequence for "write a PoC for this CVE" is identical whether you're a defender confirming remediation or an attacker building a weapon. The classifier sees the shape of the request and the shape is the same.

So the fix isn't a smarter classifier. A smarter classifier still only has the prompt to go on, and the prompt doesn't carry intent. The fix is to move the trust signal off the prompt and onto the authenticated principal.