I gave an AI agent the vended bash tool from Strands and asked it to read my AWS credentials file. At first, it refused. But then I asked again with a slightly more creative prompt and it read the file, found the keys, and then gave me a polite but stern warning that I should rotate them immediately.

Even with the warning, the point is that the agent got the keys. That's the danger of giving your agent access to a local filesystem. It can reach anything on that machine like credentials, environment variables, config files, or whatever's there. And whether the model refuses or complies depends on how you ask. A direct "read my secrets" prompt might get blocked, but a multi-turn conversation that gradually escalates from debugging to credential access might get through.

But I only found that manually. What about the attacks I wouldn't think to try? That's what automated red teaming is for. Red teaming tries to figure out how an attacker can make your agent misbehave. Automated red teaming runs jailbreaks prompts crafted to get a model to do something its instructions forbid.

This post is the walkthrough of how I used it and went from 6/9 detected breaches to 0.

The patterns apply to any agent framework, but I'll use Strands Agents, Amazon Bedrock, and Amazon Bedrock AgentCore throughout since they have a few features that make this all pretty easy to do.