Deploying large language models (LLMs) requires large-scale distributed inference, which spreads model computation and request handling across many GPUs and nodes to scale to more users while reducing latency. Distributed inference frameworks use techniques such as disaggregated serving, KV cache loading, and wide expert parallelism.In disaggregated serving environments, prefill and decode phases are run on separate GPUs, requiring efficient KV cache transfers between them. Low-latency and high-throughput communication to move these KV caches are critical to gain benefits from disaggregated serving.

In KV cache loading, storage is used to help with growing KV caches in multiturn and agentic AI workloads such as coding assistants and reasoning. For the case of long context KV, the previous results can be loaded from local SSDs and remote storage, instead of recomputing them as prefill. This is one example that explains why storage is becoming a core part of inference workloads.

In wide expert parallelism, experts are split across many GPUs, where the intermediate results (activations) have to be dispatched to and combined from these experts. Due to the requirement for ultra-low-latency communication for intermediate activations between stages, these transfers are typically initiated by the GPU through optimized kernels, referred to as device side APIs for networking, or device API in short.