An LLM-written kernel benchmarked 38% faster on a microbench. Here is what kernel-level validation showed it actually did at runtime.
TL;DR
Multi-agent LLMs are now writing CUDA kernels (RightNow AI’s AutoKernel, Meta’s KernelEvolve, a multi-agent system claiming 38% speedup on Blackwell). Source-level benchmarks measure clean throughput on a single isolated kernel. They do not measure SM occupancy under co-scheduling, DRAM bandwidth saturation, dispatcher off-CPU during a real serving workload, or NCCL wait correlation with sibling kernels. Kernel-level validation closes that gap: an eBPF trace of the same kernel running under the same workload as production answers all four questions in one capture.
The kernel-writing wave
Three pieces of work in April surfaced the same pattern: agents generate CUDA kernels, then quote a single throughput number against a baseline.






