Back to Articles
Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break?
VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard
We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.
Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.







