OpenAI has found a way to reduce its inference costs by roughly 50%, a development that could reshape the economics of running large language models at scale.

Inference is the process of actually running a trained AI model to generate responses. Every ChatGPT query, every API call, every AI-generated response burns compute.

The broader push to make AI cheaper

This cost reduction arrives amid a company-wide offensive against compute expenses. In June 2026, OpenAI unveiled a custom inference chip called Jalapeño, co-developed with Broadcom. The chip is designed to deliver better performance-per-watt for high-demand language model applications like ChatGPT, while simultaneously reducing OpenAI’s dependence on Nvidia’s GPUs.

The industry more broadly has been deploying a toolkit of optimization strategies. Techniques like quantization, which reduces the precision of model weights to shrink computational demands, and caching, which stores frequently requested outputs to avoid redundant computation, have achieved cost savings of 50% or more across multiple providers. Mixture-of-Experts architectures, where only a subset of a model’s parameters activate for any given query, represent another approach that has delivered meaningful efficiency gains.