Apple's newest on-device model carries about 20 billion parameters, and on any given request it fires maybe one to four billion of them. That gap — 20B stored, roughly 3B running — is the whole story of 2026. The model that now ships inside the latest iPhone is no longer a shrunken, lobotomized cousin of the cloud model. It's a different kind of object: large in flash, small in motion, and it never phones home.
For three years the on-device pitch was mostly aspirational. Demos ran, latency was rough, quality trailed the API by a generation, and every serious AI feature still resolved to a per-token bill in someone's datacenter. In mid-2026 that stopped being true. Two releases — Apple's third-generation Foundation Models at WWDC on June 8, and Google's Gemma 4 family on April 2 — quietly moved the floor. Genuinely useful agents now run on hardware you already own, offline, for free.
The economics nobody priced in
Forget benchmarks for a second; the load-bearing fact here is accounting. When the model lives in the cloud, every inference is a metered event — input tokens, output tokens, a line item that scales linearly with usage and explodes the moment you wrap the model in an agent loop. Agentic workloads are the worst case for the token meter: a single "go do this task" can fan out into dozens of model calls as the agent plans, calls tools, retries, and re-reads its own output. The bill grows with your ambition.









