Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Benchmark testing models have become essential for enterprises, allowing them to choose the type of performance that resonates with their needs. But not all benchmarks are built the same and many test models are based on static datasets or testing environments.

Researchers from Inclusion AI, which is affiliated with Alibaba’s Ant Group, proposed a new model leaderboard and benchmark that focuses more on a model’s performance in real-life scenarios. They argue that LLMs need a leaderboard that takes into account how people use them and how much people prefer their answers compared to the static knowledge capabilities models have.

Visa’s $3.5B Bet on AI

In a paper, the researchers laid out the foundation for Inclusion Arena, which ranks models based on user preferences.

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now

Visa’s $3.5B Bet on AI

In a paper, the researchers laid out the foundation for Inclusion Arena, which ranks models based on user preferences.

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Stop benchmarking in the lab: Inclusion Arena shows how LLMs perform in production

Related reading

An LLM benchmark is only useful for as long as it's hard

Do not choose an AI model from a leaderboard alone

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

33 LLM metrics to watch closely

LLM Speed Benchmarks: Metrics & Infrastructure Guide

Teaching the model: Designing LLM feedback loops that get smarter over time

Related reading

An LLM benchmark is only useful for as long as it's hard

Do not choose an AI model from a leaderboard alone

Standard Benchmarks Fail -- Auditing LLM Agents in Finance Must Prioritize Risk

33 LLM metrics to watch closely

LLM Speed Benchmarks: Metrics & Infrastructure Guide

Teaching the model: Designing LLM feedback loops that get smarter over time