A major evolution in LLM offensive-security evaluation, built in collaboration with POXEK AI,POXEK.

Introduction

8 months ago we released v1.0.0 of the redteam-ai-benchmark framework — a refactor focused on modular scoring, clean architecture, and an explicit ethical use policy. The response from the community exceeded expectations: security researchers, blue team leads, and solo founders building defensive tooling all found the benchmark useful for understanding what local LLMs can actually do under offensive-security pressure.

Today we are releasing v2.0 — and it is not an incremental update. It is a fundamental rethinking of how we measure LLM capability in red team contexts.

This release would not have happened without the sustained engineering contribution of POXEK AI, whose team spent months working with us on dataset design, rubric engineering, and the offline LLM-as-Judge audit layer. Their involvement moved the project from a personal tool to a community-standard evaluation framework.