Power can account for 40% of the operating expenses (OpEx) to run an AI factory. Each watt can be spent on overhead, data ingestion, training, or generating tokens for customers. And most sites are capped at a fixed power level provided by a regional provider. Under these conditions, performance per watt becomes a key efficiency metric that directly translates to token costs.

NVIDIA delivers the lowest cost per token for AI inference workloads and the lowest cost to train large models. This is possible through extreme co-design with power, cooling, and system infrastructure and deep collaboration with the OEM, ODM, CSP, NCP, systems integrator, ISV, and model ecosystems partners.

This post explores the levers that an operator can use to maximize performance per watt and minimize token cost in an AI factory.

Why is inference optimization important for AI factories?

Inference drives revenue, so it is the key workload to optimize. When operators increase inference throughput per watt, they directly increase the number of tokens they can sell or insights they can create. This also translates to additional revenue per unit of time.