I spent six weeks running identical tasks through ten different frameworks so you don't have to argue about this in Slack anymore.
There's a conversation that happens in almost every engineering team building agents right now. Someone says "we should use LangChain." Someone else says "CrewAI is better for multi-agent stuff." A third person asks if anyone has looked at AutoGen. Nobody can agree because everyone is going off demos, blog posts and vibes.
I got tired of that conversation, so I ran actual benchmarks.
Six weeks, ten frameworks, five evaluation tasks repeated consistently across all of them. The tasks were chosen to reflect what production agent systems actually need to do, not what looks impressive in a README.
Here's what I tested and what I found.












