OpenAI cuts inference costs in half with new optimization technique

OpenAI has found a way to reduce its inference costs by roughly 50%, a development that could reshape the economics of running large language models at scale.

Inference is the process of actually running a trained AI model to generate responses. Every ChatGPT query, every API call, every AI-generated response burns compute.

The broader push to make AI cheaper

This cost reduction arrives amid a company-wide offensive against compute expenses. In June 2026, OpenAI unveiled a custom inference chip called Jalapeño, co-developed with Broadcom. The chip is designed to deliver better performance-per-watt for high-demand language model applications like ChatGPT, while simultaneously reducing OpenAI’s dependence on Nvidia’s GPUs.

The industry more broadly has been deploying a toolkit of optimization strategies. Techniques like quantization, which reduces the precision of model weights to shrink computational demands, and caching, which stores frequently requested outputs to avoid redundant computation, have achieved cost savings of 50% or more across multiple providers. Mixture-of-Experts architectures, where only a subset of a model’s parameters activate for any given query, represent another approach that has delivered meaningful efficiency gains.

OpenAI has found a way to reduce its inference costs by roughly 50%, a development that could reshape the economics of running large language models at scale.

Inference is the process of actually running a trained AI model to generate responses. Every ChatGPT query, every API call, every AI-generated response burns compute.

The broader push to make AI cheaper

OpenAI cuts inference costs in half with new optimization technique

OpenAI cuts inference costs in half with new optimization technique

Other newsrooms on this story

Related reading

OpenAI reportedly cut response costs for guest ChatGPT users by more than half

OpenAI Discovers New Way to Cut Inference Costs in Half

Startups and tech giants wage AI price war as inference costs spiral out of…

12 model-level deep cuts to slash AI training costs

Ditching the cloud for local AI — how I use two mini PCs to process millions of…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…

Related reading

OpenAI reportedly cut response costs for guest ChatGPT users by more than half

OpenAI Discovers New Way to Cut Inference Costs in Half

Startups and tech giants wage AI price war as inference costs spiral out of…

12 model-level deep cuts to slash AI training costs

Ditching the cloud for local AI — how I use two mini PCs to process millions of…

Leading Inference Providers Achieve Lowest Token Cost With Open Source Models…

Other newsrooms on this story