I scored 3/50 on a take-home benchmark for a job application. And I still got the job.

At the time, I hadn't built a fully agentic system before. While I had worked with LLM pipelines and small AI tools, an entirely autonomous architecture was completely new to me. And this taught me a few important lessons — not just about AI agents, but how to approach unfamiliar engineering problems.

After showing my results in the job interview, my (now) CTO mentioned that he had noticed I was using a cheap mini LLM (endless testing had racked up quite a bill!) and that he had tried out my agent with the frontier Opus model. Funnily enough, the agent actually performed worse, scoring only 2/50!

The problem was not the model. I had architected a clean system with plugin-based tooling, consistent interfaces, and a semi-autonomous pipeline that enforced structure around the agent. My plan was to start constrained — give the agent specific tools to achieve a subset of questions and slowly expand the agent with new tools around my well-architected abstractions until it could solve everything.

On paper, the code looked solid. In practice, it couldn't solve the problems.