Tool-selection accuracy logs 1.0 while the task fails. Here's why one number hides three different failures, and the four-layer eval stack I rebuilt to find the actual root cause.

Tool-selection accuracy logs 1.0 while the task fails. Here's why one number hides three different failures, and the four-layer eval stack I rebuilt to find the actual root cause.

Tool-selection accuracy logs 1.0 while the task fails. Here's why one number hides three different failures, and the four-layer eval stack I rebuilt to find the actual root cause.