Where Tensor-Parallel Inference Hits the NVLink Wall

Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed...

domenica 31 maggio 2026 New tab

TL;DRAI

On 4× H100, all-reduce peaks at 366 GB/s—77% of NVLink ceiling—but LLM decode is bottlenecked by small-message latency, not throughput. Scaling TP past optimal slows per-token decode; measure small-message latency on the actual fabric before adding GPUs.

381 words~2 min read

Where tensor-parallel inference hits the NVLink wall

2026-05-31 · GPU / distributed systems

Tensor parallelism splits each layer across GPUs, so every forward pass pays for an

all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and

how close you get to its theoretical budget decides whether TP helps or hurts. This post

Where Tensor-Parallel Inference Hits the NVLink Wall

Where Tensor-Parallel Inference Hits the NVLink Wall

Other newsrooms on this story

Related reading

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm…

TPU Developer Hub: A Technical Review of a High-Performance AI Platform

Together AI Achieves 90% Faster BF16 Training with NVIDIA Blackwell Platform…

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs

Other newsrooms on this story

Related reading

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

Enhancing Distributed Inference Performance with the NVIDIA Inference Transfer…

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm…

TPU Developer Hub: A Technical Review of a High-Performance AI Platform

Together AI Achieves 90% Faster BF16 Training with NVIDIA Blackwell Platform…

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs