DynoSim: Simulating the Pareto Frontier | NVIDIA Technical Blog

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker counts, scheduler settings, routing policy, KV cache behavior, autoscaling thresholds, and topology. Those choices interact across layers, and a local improvement can shift the bottleneck somewhere else. For larger models, even one realistic experiment can require many GPUs or nodes before we learn whether the idea was worth testing.

That is the motivation for DynoSim: a Dynamo twin.

DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack. It combines measured engine forward-pass timing, Mocker scheduler cores, Router, and Planner behavior, KV cache effects and workload traces on one virtual timeline. The goal is not a purely analytical estimate and not a bit-exact hardware emulator. The goal is a faithful serving simulation at the atomic level of forward passes, while extending up to the full inference stack, which for us is Dynamo (and for many others as well).

Not only is DynoSim faithful, it is also blazingly fast as a full-stack Rust implementation. On an Apple M4 MacBook Air, the single-threaded Rust offline replay simulated the full 23,608-request Mooncake trace with eight round-robin workers and 512-token trace and engine blocks in 2.41 seconds of wall time. The simulated serving window was 60.1 minutes, about 1,500x faster than real time.

That is the motivation for DynoSim: a Dynamo twin.

DynoSim: Simulating the Pareto Frontier | NVIDIA Technical Blog

DynoSim: Simulating the Pareto Frontier | NVIDIA Technical Blog

Other newsrooms on this story

Related reading

Enhancing Goodput in Large-Scale LLM Training with Nonuniform Tensor…

Inference Optimization for the Rest of Us — KV Cache, Quantization, and Latency…

Optimizing Language Models: Cost vs. Performance Trade-offs in Production

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

Other newsrooms on this story

Related reading

Enhancing Goodput in Large-Scale LLM Training with Nonuniform Tensor…

Inference Optimization for the Rest of Us — KV Cache, Quantization, and Latency…

Optimizing Language Models: Cost vs. Performance Trade-offs in Production

Deploying Disaggregated LLM Inference Workloads on Kubernetes | NVIDIA…

Unlock Exascale Performance on NVIDIA GB200 NVL72 with Slurm Topology-Aware Job…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…