Securing LLM Agent Teams: Inside NRT-Defense v0.4.0

Multi-turn autonomous LLM agents are expanding rapidly in safety-critical systems. However, a major vulnerability has been exposed by Lee et al. (2026) in the NRT-Bench paper: adaptive multi-turn attacks can exploit disjoint model vulnerabilities, causing a 8.7% to 12.1% loss of Critical Safety Functions (CSFs).

To solve this, I am open-sourcing NRT-Defense, an adaptive multi-turn defense framework designed to monitor agent sessions and reduce the attack success rate to <1%.

The Threat: Context Drift and Disjoint Exploits

Standard guardrails evaluate prompts in isolation (single-turn). Attackers leverage this by spreading an exploit across multiple conversational turns. Turn by turn, the context drifts until the agent team completely bypasses its safety containment.