ADeLe: Predicting and explaining AI performance across tasks - Microsoft Research

AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI […]

mercoledì 1 aprile 2026 New tab

At a glance

AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.

Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.

It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.

By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.

ADeLe: Predicting and explaining AI performance across tasks - Microsoft Research

ADeLe: Predicting and explaining AI performance across tasks - Microsoft Research

Related reading

AI is ready to take over Python programming, but not much else

AstaBench update: New results, plus adoption from industry | Ai2

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

33 LLM metrics to watch closely

How to evaluate and benchmark Large Language Models (LLMs)

AI Evaluators Struggle with Models That Know When They’re Being Tested

Related reading

AI is ready to take over Python programming, but not much else

AstaBench update: New results, plus adoption from industry | Ai2

Small Language Models Outperform Frontier AI On Cost, Speed And Accuracy

33 LLM metrics to watch closely

How to evaluate and benchmark Large Language Models (LLMs)

AI Evaluators Struggle with Models That Know When They’re Being Tested