Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

If you only look at the spec sheet, the TPU story is overwhelming: lower cost-per-token, dramatically better watts-per-token, deterministic latency. Trainium tells the same story. And yet a large share of the industry — including most of the inference traffic behind consumer chat UIs like ChatGPT — still runs on NVIDIA. The gap between "cheaper on paper" and "what people actually deploy" is not a marketing failure. It's an architectural tax that systolic-array silicon charges you in code, pipelines, and org structure. This post is about where that tax comes from and why only a handful of companies can afford to pay it.

The one architectural fact that explains everything: static shapes

NVIDIA GPUs are SIMT (Single Instruction, Multiple Threads) processors. They schedule threads dynamically at runtime and page memory on demand. TPUs and AWS Trainium are not GPUs — they are systolic arrays: a grid of multiply-accumulate units wired directly to their neighbors, fed by an ahead-of-time compiler (XLA for TPU, the Neuron compiler for Trainium).

A systolic array hits peak utilization only when the shape of the data flowing through it is fixed at compile time. Weights are loaded once and stay stationary in the processing elements; activations slide through like a bucket brigade. Change the sequence length or batch size by even one token and the data routes and memory addresses have to be recomputed — which means the compiler has to generate a new binary.

The one architectural fact that explains everything: static shapes

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

Other newsrooms on this story

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

Other newsrooms on this story

Related reading

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

Inference Archives

H100 vs GB200 NVL72 Training Benchmarks - Power, TCO, and Reliability Analysis,…

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per…

Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization |…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…

Related reading

Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

Inference Archives

H100 vs GB200 NVL72 Training Benchmarks - Power, TCO, and Reliability Analysis,…

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per…

Adaptive Inference in NVIDIA TensorRT for RTX Enables Automatic Optimization |…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…