I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture.

Six weeks ago I shipped Lunaris Guard v0.1 — a dual-head classifier for prompt injection and content...

lunedì 25 maggio 2026 New tab

652 words~3 min read

Six weeks ago I shipped Lunaris Guard v0.1 — a dual-head classifier for prompt injection and content safety. On paper, it looked decent: 0.74 F1 on injection, multilingual coverage, Apache 2.0.

Then I tested it on something that wasn't in the training data.

It failed. 63% of the time.

That number — 37% recall on novel attacks — meant v0.1 was useless in production. Attackers don't send you prompts from your training set. They send you things you've never seen.

So I burned the v0.1 weights and started over.

I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture.

I Built a Prompt Injection Detector with 98% Recall on Unseen Attacks. Here's Why Data Beat Architecture.

Related reading

Three prompt injection stories from this week that your guardrail probably…

Prompt injection disclosures: 4 labs compared

I tested 5 LLMs for prompt-injection leaks. Same code, 0% to 90%.

I Fired 49 Attack Prompts at an AI. 25 of Them Worked.

The Attack Vectors Nobody Tells You About: Hardening LLM Apps Against Prompt…

A real prompt-injection case — and the blind spot it exposed in my own scanner

Related reading

Three prompt injection stories from this week that your guardrail probably…

Prompt injection disclosures: 4 labs compared

I tested 5 LLMs for prompt-injection leaks. Same code, 0% to 90%.

I Fired 49 Attack Prompts at an AI. 25 of Them Worked.

The Attack Vectors Nobody Tells You About: Hardening LLM Apps Against Prompt…

A real prompt-injection case — and the blind spot it exposed in my own scanner