Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

A Blog post by IBM Research on Hugging Face

martedì 31 marzo 2026 New tab

2,875 words~13 min read

Back to Articles

Task Description Evaluation Framework Error Analysis Conclusion Try VAKRA — Where Does Your Agent Break?

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

Related reading

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

The Open Agent Leaderboard

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

How I Built a Self-Verifying AI Agent with DynamoDB and ReAct Reasoning

Katra: Giving AI Agents a Vulcan Mind Meld

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

Related reading

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

The Open Agent Leaderboard

Evaluate AI agents systematically with Agent-EvalKit | Amazon Web Services

How I Built a Self-Verifying AI Agent with DynamoDB and ReAct Reasoning

Katra: Giving AI Agents a Vulcan Mind Meld

IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST