Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

As AI workloads scale, achieving high throughput, efficient resource usage, and predictable latency becomes essential. NVIDIA Run:ai addresses these challenges through intelligent scheduling and dynamic GPU fractioning. GPU fractioning is wholly delivered by NVIDIA Run:ai in any environment—cloud, NCP, and on-premises.

This post presents the joint benchmarking effort between NVIDIA and AI cloud provider Nebius to evaluate how NVIDIA Run:ai fractional GPU allocation can improve large language model (LLM) inference performance. Nebius’ AI Cloud provided the infrastructure foundation, dedicated NVIDIA GPUs, NVIDIA Quantum InfiniBand networking, and hyperscaler-grade performance and elasticity needed to deliver these gains at production scale.

All benchmarks were executed using NVIDIA NIM microservices. This approach provides standardized, production-grade model deployment with consistent performance, security, and lifecycle management across environments.

The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs:

77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second

The results show that fractional GPUs dramatically increase effective capacity without compromising latency SLAs:

77% of full GPU throughput and 86% of full-GPU concurrent user capacity using only 0.5 GPU fraction, with time to first token (TTFT) under one second

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

Unlock Massive Token Throughput with GPU Fractioning in NVIDIA Run:ai | NVIDIA Technical Blog

Related reading

Accelerate Token Production in AI Factories Using Unified Services and…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA…

NVIDIA Technical Blog

NVIDIA Unlocks AI Compute at Scale, Inviting Partners to Power the AI…

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per…

Related reading

Accelerate Token Production in AI Factories Using Unified Services and…

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM | NVIDIA Technical…

Get Real-Time Visibility into GPU Usage Across Kubernetes Clusters | NVIDIA…

NVIDIA Technical Blog

NVIDIA Unlocks AI Compute at Scale, Inviting Partners to Power the AI…

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per…