As shown in the matrix above, a Level 3 tool-calling agent can be assigned to Oversight Levels 1, 2, 3, 4, or 5, depending on risk, not capability.

Agent LevelWhat to MeasureAutonomy RangeCommon Failure ModesExecutive MetricLevel 1 — Basic ResponderRelevance, factual accuracy, latency, token efficiencyIntelligent AutomationHallucination, repetition loops, prompt injectionCustomer satisfaction, cost per interactionLevel 2 — RouterPrecision/recall/F1, confusion matrix, multi-intent detection, fallback handlingConditional AutonomyIntent misclassification, over-confident routingFirst-contact resolution rateLevel 3 — Tool-Calling AgentTool selection accuracy, parameter extraction, error recovery, cost optimizationConditional → High AutonomyWrong tool selection, parameter hallucination, cascading failuresTask completion rate, automation ROILevel 4 — Multi-Agent SystemOrchestration efficiency, handoff success, system-level goal completion, credit assignmentHigh AutonomyDeadlocks, communication overhead, diffusion of responsibilityEnd-to-end process efficiencyLevel 5 — Autonomous AgentIndependent goal achievement, cross-domain generalization, adaptation rate, novel solutionsFull Autonomy (theoretical/high-risk)Goal misalignment, value drift, overconfidenceHuman intervention hours saved