I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

10 adversarial scenarios, 64 assertions, 3-tier evaluation pyramid. Llama, Qwen, GPT-OSS — none scored above 63%. Here's what broke them.

lunedì 8 giugno 2026 New tab

1,935 words~9 min read

TL;DR

I built agent-eval, a framework that runs real agentic loops with tool calls against live LLM backends, then evaluates outputs through a three-tier assertion pyramid. I threw 10 adversarial scenarios at 5 models. The best scored 62.5%. The worst scored 34%.

Every model failed the same three tests. That's the interesting part.

The Problem With LLM Evals

Most LLM evaluations test the wrong thing. They check if the model can answer trivia, write code snippets, or follow formatting instructions. That's like testing a car's paint job instead of its brakes.

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

Other newsrooms on this story

Related reading

Why I used three different critic roles instead of one (and what the eval…

GBase: Building LLM Agents That Actually Learn from Their Mistakes

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10…

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+…

Overcoming LLM Limitations

I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source…

Other newsrooms on this story

Related reading

Why I used three different critic roles instead of one (and what the eval…

GBase: Building LLM Agents That Actually Learn from Their Mistakes

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10…

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+…

Overcoming LLM Limitations

I Got Tired of LLMs Hallucinating Compliance, So I Built an Open-Source…