The Microsecond Lie: Why your Go timers are lying about the GPU

TL;DR: I thought my CUDA kernel was running in 160 microseconds. I was wrong. Here is how I used CUDA Events in pure Go to find the real hardware time, and why CPU-side timers are the wrong tool for GPU forensics.

I wrapped my kernel launch in a standard Go time.Since(start) block and saw 162 microseconds.

I thought I had built a speed demon. Then I implemented real GPU Events and found the truth.

The Misleading Metric

When you launch a CUDA kernel, it is completely asynchronous. The CPU doesn't wait for the GPU to finish; it just puts the task in a queue (a Stream) and returns control to your Go program immediately.

The Microsecond Lie: Why your Go timers are lying about the GPU

Other newsrooms on this story

Related reading

From Kernel Scheduler to Python Source Line: Tracing a GPU Stall End to End

I Was Wrong About Events for Three Years—Until I Learned What Async Runtime Was…

GPU Incident at 3am: eBPF Tracing from Page to Root Cause in 60 Seconds

One RTX 5090 vs a 12-GPU Cluster — Benchmarking a Decade of GPUs on the Same Go…

I Tested DeepSeek V4 Flash and GPT-4o Side by Side — Here's the p99 Latency…

This Rewrite Isnt the Constraint: How a 300ms Tail Latency Hunt Led to a New…