JUNE 15–18|SAN FRANCISCO

Join us at the world’s largest data, apps and AI event.

Faster, secure OSS LLM inference with prompt caching.

by Pei-Lun Liao, Asfandyar Qureshi, Roshan Regula, Bruce Fontaine, James Thomas and Chenyang Yu

Large language model (LLM) inference often involves repeated prompts—think of the same system or instruction prompt appearing in thousands of requests. Reprocessing that identical prefix for every call wastes compute cycles, inflates latency, and increases costs.