Back to Articles

Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break?

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.