We’ve all heard the mantra from the quants in the business community: you can’t manage what you can’t measure. And if that’s true for human intelligence, it should be true for the artificial kind too.
How do we measure agents and large language models (LLMs)? We’re just beginning to come up with statistical metrics. Here are several of the most common metrics that designers and users toss about when they’re evaluating a model.
[ See also: 27 questions to ask before choosing an LLM ]
Time to first token
How long does it take to generate the first token? For real-time applications with time constraints, faster responses can be essential. It’s well-known that people hate waiting even a few milliseconds. The teams that develop user interfaces learned decades ago that it’s important for the software to respond quickly when a human is waiting for an answer. Even a few seconds of delay mean that the human will wander off to another window to check some email or place some bet on a prediction market. Time to first token is a good measure for models that will be working directly with the fickle human intelligences and their latent attention deficit disorder.












