Auto-Generated CUDA Kernels Need Kernel-Level Validation

An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.

TL;DR

Multi-agent LLMs are now writing CUDA kernels (RightNow AI’s AutoKernel, Meta’s KernelEvolve, a multi-agent system claiming 38% speedup on Blackwell). Source-level benchmarks measure clean throughput on a single isolated kernel. They do not measure SM occupancy under co-scheduling, DRAM bandwidth saturation, dispatcher off-CPU during a real serving workload, or NCCL wait correlation with sibling kernels. Kernel-level validation closes that gap: an eBPF trace of the same kernel running under the same workload as production answers all four questions in one capture.

The kernel-writing wave

Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.

An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.

TL;DR

The kernel-writing wave

Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Related reading

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Generation-Side Tooling Outpaces Validation-Side Tooling

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

Custom Kernels for All from Codex and Claude

Stop Using LLMs to Audit Other LLMs: You Are Bricking Your Production Latency

Related reading

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Generation-Side Tooling Outpaces Validation-Side Tooling

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

Custom Kernels for All from Codex and Claude

Stop Using LLMs to Audit Other LLMs: You Are Bricking Your Production Latency