How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good?

This deceptively simple question has occupied researchers for more than two decades.

Long before ChatGPT, machine translation researchers faced exactly the same problem. Human evaluation was expensive, inconsistent, and painfully slow. If every new model required thousands of humans to compare translations, research would crawl.

That necessity gave rise to BLEU, one of the most influential evaluation metrics in AI history. Years later, as language models became better at paraphrasing and reasoning, BLEU started to show its age. Researchers responded with learned metrics like BLEURT and COMET, which use neural networks to judge language much more like humans do.

An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good?

This deceptively simple question has occupied researchers for more than two decades.

How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT

How We Actually Measure Whether an LLM's Output Is Good - BLEU, COMET and BLEURT

Related reading

The Scaling Laws That Made LLMs Work

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

Stop Your LLM From Getting Owned

Reinforcement Learning with Verifiable Rewards: Why AI is Learning to Grade Its…

Self-Attention: The Brilliant Idea That Made Large Language Models Possible

Sequence Transduction: The Forgotten Problem That Led to Modern LLMs

Related reading

The Scaling Laws That Made LLMs Work

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the…

Stop Your LLM From Getting Owned

Reinforcement Learning with Verifiable Rewards: Why AI is Learning to Grade Its…

Self-Attention: The Brilliant Idea That Made Large Language Models Possible

Sequence Transduction: The Forgotten Problem That Led to Modern LLMs