ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

SummaryLLMs have gotten surprisingly good at writing GPU kernels[1][2][3], but almost all current benchmarks measuring that progress are single-GPU. In production, communication is often the bottleneck: communication overhead can account for over 20% of inference latency[4], and that gap keeps widening as compute scales faster than interconnect bandwidth.ParallelKernelBench (PKB) offers a benchmark and evaluation framework for multi-GPU kernel generation and includes 87 problems from real codebases where the task is replacing PyTorch + NCCL with a CUDA kernel that moves data directly over NVLink. We tested frontier coding models such as GPT-5.5, Gemini 3 Pro, Opus 4.7, and others. The evaluation revealed significant performance gaps across the board: under a third of problems were solved correctly, and fewer than a quarter of those beat the naive baseline.We'll cover why they fail, what the patterns look like, and a few cases where models surprisingly produced kernels faster than anything publicly available, including one for NVIDIA NeMo-RL's GRPO training loop, which has no prior optimized public reference.Why multi-GPU is different from single-GPU kernel generationLLMs have made progress on GPU kernel generation, but that progress has mostly been measured on a single GPU. Production AI workloads no longer fit that frame: they span multiple GPUs, and performance is increasingly shaped by communication rather than just local compute and memory. That shift makes multi-GPU kernel generation a different problem in three ways:The design space expands combinatorially. Practitioners compose tensor, expert, data, context, and sequence parallelism to fit the hardware, and each composition creates a different communication pattern.The performance model changes. A single-GPU roofline is built around compute and memory bandwidth. In multi-GPU code, the bottleneck is often the interconnect.Multi-GPU kernel generation introduces a critical new design choice: how to move data between GPUs — through the copy engine, TMA, SM load/store, or NVLS — and whether to fuse that movement with compute.ParallelKernelBenchWe built PKB to test whether models can move beyond pure torch.dist and actually write production multi-GPU kernels. Each problem starts from a standard PyTorch + NCCL implementation and a description of the hardware topology. The model then has to replace that reference with a CUDA kernel that communicates directly across GPUs using symmetric memory.PKB evaluation pipeline. Each problem provides a task, hardware topology, and PyTorch + NCCL reference; the model generates a custom CUDA kernel that is evaluated for correctness, wall-clock speedup, and communication roofline.To make sure the 87 problems cover the real space of production parallelism types, we built them from a taxonomy of distributed workloads. First, we identified the major ways models get sharded — tensor, context, data, expert, sequence, and FSDP/ZeRO — along with the communication patterns each one creates. Then we chose 87 problems to cover that space taken from the codebases of systems like Megatron-LM, DeepSpeed, DeepEP, TensorRT-LLM, NeMo-RL, as well as a long tail of non-LLM workloads: GNN routing, distributed FFTs, Gaussian splatting, etc. Another benefit is that because PKB references are written in standard PyTorch + NCCL, the benchmark is not tied to any single, particular hardware generation. Instead, it is designed to naturally evolve alongside next-generation hardware architectures.Taxonomy for parallelizing standard transformer blocks. Different sharding strategies create distinct communication patterns across normalization, attention, and MLP, illustrated here for a representative Gemma3-27B layer.PKB problem coverage across parallelism types (left) and source codebases (right), spanning RL post-training, LLM training, kernel libraries, vision models, GNNs, and more.Before evaluating models, we first checked whether the PyTorch + NCCL baselines leave real headroom. A communication-aware roofline says yes: most PKB problems are bottlenecked by NVLink, and the baselines run far below the hardware ceiling. So the next question is simple: can models close that gap?How frontier models do on PKBNot well. In the zero-shot setting, the best model solves 28 of 87 problems, and only 22 of those solutions are faster than the PyTorch + NCCL baseline. Sampling three attempts improves the best result to 36 correct solutions and 27 faster-than-baseline solutions, but fast1@3 still tops out at 31%.

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Other newsrooms on this story

Related reading

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

An LLM benchmark is only useful for as long as it's hard

CUDA 13.3 Lands, AI Writes Blackwell Kernels, & FP4 VRAM Optimization for LLMs

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

LLM Speed Benchmarks: Metrics & Infrastructure Guide

Other newsrooms on this story

Related reading

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

An LLM benchmark is only useful for as long as it's hard

CUDA 13.3 Lands, AI Writes Blackwell Kernels, & FP4 VRAM Optimization for LLMs

I Benchmarked 3 Local LLMs on My Laptop — Here's What the Numbers Actually Show

LLM Speed Benchmarks: Metrics & Infrastructure Guide