ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
ParallelKernelBench tests whether LLMs can write fast multi-GPU CUDA kernels across 87 real workloads. The best model solves under a third, but a few generated kernels beat any public implementation.