Running AI on mixed hardware for speed and affordability

As enterprises shift from experimenting with LLMs to serving them to customers, many are choosing to deploy their models on premises to ensure full control of their stack. Under the sovereign approach, AI workloads are processed on hardware that an enterprise owns or leases from a cloud provider.Keeping data and models close has several benefits. It can give enterprises greater control over performance and reduce the risk of data leaks and security breaches. It can also be an effective way to control expenses amid surging demand for AI applications. Generative AI’s two main bottlenecks — high memory requirements and inefficient GPU utilization — can bog down service and drive up costs as deployments grow larger.This is why the open-source community built llm-d — to orchestrate high-performance inference engines like vLLM and SGLang to manage the constraints of high-volume inference. By distributing inference requests efficiently, llm-d is designed to help enterprises and cloud providers serve more customers while holding down costs.At its core is a cache-aware router that sends incoming requests to the vLLM instance most likely to hold pre-computed data in its key value (KV) cache. It also separates the prompt-processing “prefill” step from the text-generating “decoding” step so that each can be optimized across dedicated hardware pools.llm-d was designed, at least theoretically, to run multi-vendor GPUs in the same production cluster so enterprises could use older or lower-cost hardware for low-priority tasks, while reserving their most expensive hardware for critical workloads.But in practice, creating a coherent serving layer out of a hodge-podge of GPUs is full of technical challenges, from reconciling divergent driver stacks and container runtimes to figuring out how to drain and reschedule in-flight requests without violating latency service level objectives (SLOs).IBM Research, Red Hat, and NxtGen Cloud Technologies, one of India’s leading sovereign cloud providers, recently set out to reconfigure llm-d to improve its performance on mixed GPU clusters.In a series of experiments on the NxtGen sovereign cloud, they found that llm-d could run IBM Granite and Sarvam AI models on diverse hardware three to five times faster, and to potentially twice as many users, than serving the same models without llm-d.The experiments may be the first to demonstrate llm-d's potential to both improve service for customers and help enterprises save money over heterogeneous hardware. And though India was chosen as a test site, researchers said they expect similar results under any other open-source largescale deployment.“llm-d's Kubernetes-native control plane delivers higher throughput, lower latency and better infrastructure utilization across accelerators, allowing enterprises and sovereign cloud providers to use their existing infrastructure,” said Priya Nagpurkar, vice president of hybrid cloud and AI platform at IBM Research. “This open, cloud-native approach can provide the scale and flexibility to keep down costs.”A smart router to reuse the KV cacheLLM workloads often contain content that repeats across users or sessions. Under a traditional Kubernetes setup, requests are spread across available pods, round-robin-style, with requests often landing on instances with no cached state for that prefix. The KV cache, as a result, has to be rebuilt on each node, leading to redundant computation. llm-d, by contrast, is equipped with a hardware-agnostic router that can find previously computed cache in a multi-node cluster. It tracks the KV cache state of each vLLM instance in real time, routing incoming requests to the instance most likely to hold matching prompt prefixes in memory, regardless of which accelerator type processed it.The researchers found that llm-d consistently outperformed the traditional Kubernetes setup on both response time, measured by throughput, and time-to-first-token (TTFT), the time it takes for users to receive their first token after submitting a prompt.llm-d consistently outperformed the traditional Kubernetes setup on both throughput and time-to-first-token.Under a traditional Kubernetes configuration, requests are balanced evenly, causing slower GPUs to drag down overall throughput. llm-d’s prefix-cache-aware router, however, can increase capacity by diverting cache hits to open pods.As traffic increased in a pool of 20 GPU pods spanning three vendors (A+B+C), llm-d response times and throughput (in green) improved over a traditional Kubernetes round-robin setup (in grey).In a pool of 20 GPU pods spanning three different vendors, the traditional Kubernetes setup reached peak output of about 9,600 tokens per second under moderate traffic and fell to 7,500 tokens per second as demand increased. With llm-d, however, the same heterogeneous pods were able to reach 14,200 tokens per second under heavy traffic and reduce response times by nearly half a minute.Squeezing more tokens from each GPUThe improved response time and throughput that llm-d brings can add up to real savings. The researchers calculated that using vLLM and llm-d to serve a Sarvam-30B model to a thousand users at once could save, on average, up to $5.25 million a year, with GPU costs set conservatively at $3 per hour.For enterprises, the bottom line is that twice as many customers can potentially be served three to five times faster by deploying workloads on llm-d. The researchers explain the details in a technical blog post for the llm-d community.The llm-d community plans to further improve llm-d by implementing efficient routing. Compute-heavy pre-filling tasks could be sent to GPU nodes from one vendor, for example, while memory intensive decoding could go to nodes from another vendor. To operationalize this feature, the KV cache transfer library has to be compatible across GPU backends.Data residency rules and other regulatory concerns have made AI deployments on-premises more attractive to some enterprises. But clearly a business case for sovereign AI and taking control of the stack could be made, too.Properly configured, llm-d can boost capacity and reduce needless spending. Rather than having to buy the latest, most expensive hardware, enterprises can spread their AI workloads across a variety of GPUs, including those they already own. AI application users benefit, too, by getting answers and solutions faster."This benchmarking exercise shows what’s possible when Indian sovereign infrastructure meets world-class open innovation,” said A. S. Rajgopal, managing director and CEO of NxtGen. “llm-d's decisive advantage over a traditional Kubernetes setup validates the direction we've chosen.”It’s yet another example of how open-source collaboration can help enterprises create value and avoid vendor lock-in, said Vincent Caldeira, chief technology officer for Red Hat’s Asia-Pacific region. “Through llm-d's community-driven architecture, Indian businesses can now unify diverse, multi-vendor hardware pools," he said.

Running AI on mixed hardware for speed and affordability

Running AI on mixed hardware for speed and affordability

Other newsrooms on this story

Related reading

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Solving the Infrastructure Crisis for AI Inference with Dataflow

No Cloud, No Vendor Lock-In: Running AI Agents on Hardware You Control

Virtual AI testbed lets developers verify massive LLM servers before…

I Tested 9 Serverless GPU Providers for AI Inference in 2026. Here's What I'd…

Other newsrooms on this story

Related reading

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Solving the Infrastructure Crisis for AI Inference with Dataflow

No Cloud, No Vendor Lock-In: Running AI Agents on Hardware You Control

Virtual AI testbed lets developers verify massive LLM servers before…

I Tested 9 Serverless GPU Providers for AI Inference in 2026. Here's What I'd…