Winograd convolutions cost us 2 mAP and we didn't notice for a month

TL;DR: We turned on Winograd convolution to shave latency off a pedestrian detector running on a Cortex-A53, got a clean 18% speedup, and silently lost 2.1 mAP because the F(4,3) transform overflowed in fp16. The accuracy drop hid inside our aggregate metric for almost a month before a per-distance breakdown caught it.

So, the thing is, Winograd convolution is one of those optimisations that looks free. You replace the direct 3x3 convolution with a set of input transforms, elementwise multiplies, and an output transform, and the arithmetic count drops. For F(4,3), the standard tiling, you go from 36 multiplies per output tile down to 16. On paper that's a 2.25x reduction in MACs for your 3x3 layers, and 3x3 is most of a modern backbone.

We run a small detector on a Cortex-A53 board for an indoor people-counting product, MobileNetV3 backbone, roughly 4.2M params after pruning. The team is three CV engineers and one firmware person. We had a 41ms inference budget and were sitting at 39ms, which is the kind of margin that keeps you up at night.

What we turned on

Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.

What we turned on

Our runtime exposes Winograd as a per-layer flag. We flipped it on for every 3x3 stride-1 layer, rebuilt, and measured.

Winograd convolutions cost us 2 mAP and we didn't notice for a month

Winograd convolutions cost us 2 mAP and we didn't notice for a month

Other newsrooms on this story

Related reading

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Local Gradient Accumulation Speeds Training 1.7

Channels-last memory format cut our conv backbone latency 22%

Unweight: how we compressed an LLM 22% without sacrificing quality

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…

Optimizing inference speed and costs: Lessons learned from large-scale…

Other newsrooms on this story

Related reading

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Local Gradient Accumulation Speeds Training 1.7

Channels-last memory format cut our conv backbone latency 22%

Unweight: how we compressed an LLM 22% without sacrificing quality

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…

Optimizing inference speed and costs: Lessons learned from large-scale…