Speeding up GPU kernels by 38% with a multi-agent system · Cursor

Over the past few months, we've been developing a multi-agent system that can build, maintain, and deploy complex software autonomously. As part of that work, we've been testing the system in a variety of domains, including having it build a browser from scratch and solve a research-level math problem on the First Proof benchmark.

Recently, we began collaborating with NVIDIA on a new challenge: applying the multi-agent harness to optimize CUDA kernels. These are difficult technical problems with important real-world consequences: CUDA kernels are the core software that supports AI model training and inference on NVIDIA GPUs. Faster kernels mean better GPU utilization, reduced energy consumption, lower latency, and reduced cost per token—allowing providers to serve bigger, more capable models to more users at once.

Our multi-agent harness operated autonomously for three weeks across 235 problems. The system achieved a 38% geomean speedup by building and optimizing Blackwell GPU kernels from scratch, all the way down to the assembly level.

These levels of performance improvement are typically only found through months or years of work from highly experienced kernel engineers. The multi-agent system accomplished it in weeks, addressing a long-tail of kernel problems that had been impractical with existing approaches.

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

Speeding up GPU kernels by 38% with a multi-agent system · Cursor

Other newsrooms on this story

Related reading

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Custom Kernels for All from Codex and Claude

Nvidia Blackwell achieves 20x more agents per megawatt than Hopper

Towards self-driving codebases · Cursor

New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to…

Related reading

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

Auto-Generated CUDA Kernels Need Kernel-Level Validation

Custom Kernels for All from Codex and Claude

Nvidia Blackwell achieves 20x more agents per megawatt than Hopper

Towards self-driving codebases · Cursor

New SemiAnalysis InferenceX Data Shows NVIDIA Blackwell Ultra Delivers up to…

Other newsrooms on this story