TL;DR: I thought my CUDA kernel was running in 160 microseconds. I was wrong. Here is how I used CUDA Events in pure Go to find the real hardware time, and why CPU-side timers are the wrong tool for GPU forensics.
I wrapped my kernel launch in a standard Go time.Since(start) block and saw 162 microseconds.
I thought I had built a speed demon. Then I implemented real GPU Events and found the truth.
The Misleading Metric
When you launch a CUDA kernel, it is completely asynchronous. The CPU doesn't wait for the GPU to finish; it just puts the task in a queue (a Stream) and returns control to your Go program immediately.










