TL;DR: I built gocudrv so Go services can talk directly to NVIDIA GPUs — no cgo, no CUDA toolkit, no bloated Python dependencies. One static binary.

Last month I was reviewing a production AI service. The core business logic was clean, efficient Go (15MB binary), but GPU access was routed through a Python sidecar.

The results were painful:

8.4GB Docker images — bloated with unused CUDA toolkits and PyTorch dependencies

4-minute cold starts during autoscaling