Back to Articles

Large Language Models (LLMs) have rapidly evolved from text-only assistants into complex agentic systems capable of performing multi-step reasoning, calling external tools, retrieving memory, and executing code. With this evolution comes an increasingly sophisticated threat landscape: not only traditional content safety risks, but also multi-turn jailbreaks, prompt injections, memory hijacking, and tool manipulation.

In this work, we introduce AprielGuard, an 8B parameter safety–security safeguard model designed to detect:

16 categories of safety risks, spanning toxicity, hate, sexual content, misinformation, self-harm, illegal activities, and more.

Wide range of adversarial attacks, including prompt injection, jailbreaks, chain-of-thought corruption, context hijacking, memory poisoning, and multi-agent exploit sequences.