Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
An AI model writes a paragraph. It sounds fluent. It looks convincing. But how do you know whether it's actually good?
This deceptively simple question has occupied researchers for more than two decades.
Long before ChatGPT, machine translation researchers faced exactly the same problem. Human evaluation was expensive, inconsistent, and painfully slow. If every new model required thousands of humans to compare translations, research would crawl.
That necessity gave rise to BLEU, one of the most influential evaluation metrics in AI history. Years later, as language models became better at paraphrasing and reasoning, BLEU started to show its age. Researchers responded with learned metrics like BLEURT and COMET, which use neural networks to judge language much more like humans do.






