I’ve been building a multi-agent research system. The idea is simple: give it a controversial technical topic like “Should we rewrite our Python backend in Rust?”, and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt.

It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side. Not wrong, but noticeably lopsided. I mean, rewriting the Sentry monorepo in Rust is arguably a bad idea, but it was arguing against on things where I clearly knew it should be for it.

I eventually traced it to the Skeptic’s web_search tool. The Advocate was returning 3-4 solid data points per query. The Skeptic, however, was searching for different terms that didn’t match the data as well, and was getting back a single generic result. So the Advocate’s brief was well-sourced with citations, and the Skeptic’s brief was… vibes. The Synthesizer did what any reasonable reader would do: it weighted the better-sourced argument more heavily.

The bug was in a tool call, inside one agent, that silently degraded the input to a completely different agent two steps later. I only found it by clicking through the trace and reading tool outputs at each step.