Originally published on lavkesh.com
The AI model landscape has changed. The focus is no longer on building the largest model, but on creating smaller models that are still useful. Google's Gemini 1.5 Flash, Meta's Llama 3.1 8B, and Microsoft's Phi-3 Mini are currently winning in real production deployments. The reasons for their success come down to simple engineering economics.
Large models like GPT-4 are capable but expensive to run at scale. For features handling millions of requests daily, small differences in per-token cost add up to serious infrastructure expenses. Companies have found that deploying GPT-4 for every user interaction may not be necessary. A model that is good enough and costs less can be a better choice.
For many tasks, a model that is good enough will win. Tasks like classifying support tickets, summarizing short documents, generating structured output from a template, and answering FAQ-style questions from a knowledge base can be handled reliably by a well-tuned 8 billion parameter model. Only the genuinely hard problems require GPT-4 scale reasoning.
In practice, this means that the cost of running a large model can be 5 to 10 times higher than a smaller model, depending on the specific use case and the efficiency of the deployment. For example, a company like Amazon may need to process tens of millions of product reviews daily. Using a smaller model like Meta's Llama 3.1 8B can save millions of dollars in infrastructure costs per year, compared to using a larger model like GPT-4. This is a trade-off that companies are willing to make, given the significant cost savings.






