Let's talk about LLM evaluation

Back to Articles

How do we do LLM evaluation? Benchmarks Human as a judge Model as a judge Why do we do LLM evaluation? 1) Is my model training well? Is my training method sound? - Non-regression testing 2) Which model is the best? Is my model better than your model? - Leaderboards and rankings 3) Where are we, as a field, in terms of model capabilities? Can my model do X? Conclusion Acknowledgements Since my team works on evaluation and leaderboards at Hugging Face, at ICLR 2024 (2 weeks ago) a lot of people wanted to pick my brain about the topic (which was very unexpected, thanks a lot to all who were interested).

Thanks to all these discussions, I realized that a number of things that I take for granted evaluation wise are 1) not widely spread ideas 2) apparently interesting.

So let's share the conversation more broadly!

How do we do LLM evaluation?

Back to Articles

Thanks to all these discussions, I realized that a number of things that I take for granted evaluation wise are 1) not widely spread ideas 2) apparently interesting.

So let's share the conversation more broadly!

How do we do LLM evaluation?

Let's talk about LLM evaluation

Let's talk about LLM evaluation

Related reading

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Together Evaluations: Benchmark Models for Your Tasks

Exploring LLM-as-a-Judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

How to evaluate and benchmark Large Language Models (LLMs)

An open source LLM eval tool with two independent quality signals

Related reading

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Together Evaluations: Benchmark Models for Your Tasks

Exploring LLM-as-a-Judge

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

How to evaluate and benchmark Large Language Models (LLMs)

An open source LLM eval tool with two independent quality signals