Introducing Gateway API Inference Extension

Modern generative AI and large language model (LLM) services create unique traffic-routing challenges on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches. Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

giovedì 5 giugno 2025 New tab

By Daneyon Hansen (Solo.io), Kaushik Mitra (Google), Jiaxin Shan (Bytedance), Kellen Swain (Google) |

Thursday, June 05, 2025Modern generative AI and large language model (LLM) services create unique traffic-routing challenges

on Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often

long-running, resource-intensive, and partially stateful. For example, a single GPU-backed model server

may keep multiple inference sessions active and maintain in-memory token caches.Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed

Introducing Gateway API Inference Extension

Introducing Gateway API Inference Extension

Related reading

Proxy OpenAI Through Kong AI Gateway on Kubernetes

Running a High-Performance AI Gateway on Kubernetes

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Enterprise LLM Gateway: Route, govern, and secure your AI traffic

NVIDIA Technical Blog

Implementing resilience patterns with Amazon Bedrock and LLM gateway | Amazon…

Related reading

Proxy OpenAI Through Kong AI Gateway on Kubernetes

Running a High-Performance AI Gateway on Kubernetes

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Enterprise LLM Gateway: Route, govern, and secure your AI traffic

NVIDIA Technical Blog

Implementing resilience patterns with Amazon Bedrock and LLM gateway | Amazon…