Channels-last memory format cut our conv backbone latency 22%

TL;DR: Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.

Our background-removal model at Photoroom spent roughly 31 ms per 1024x1024 image on an A100, and profiling pointed most of that time at cuDNN convolution kernels rather than our diffusion sampler. The model is a fairly standard U-Net style encoder-decoder, all convolutions, running in float16 under torch.autocast. Before touching the architecture, I wanted to rule out the cheap wins, and the cheapest one turned out to be tensor memory layout. The channels-last memory format gave us most of the speedup we were chasing, and the change fit in a handful of lines. To be precise, the network math is identical; only the byte order of the activations changes.

What channels-last memory format changes

The channels-last memory format stores a 4D activation tensor in NHWC byte order, keeping the channel values for one spatial position contiguous in memory. PyTorch keeps the logical NCHW shape, so your indexing and your model code stay the same. What changes is the stride pattern, which lets cuDNN select kernels that read contiguous channels and run more efficiently on tensor-core hardware.

TL;DR: Switching our convolutional segmentation backbone to PyTorch's channels-last memory format cut inference latency by about 22% on A100s, with no accuracy change and a four-line code edit.

What channels-last memory format changes

Channels-last memory format cut our conv backbone latency 22%

Channels-last memory format cut our conv backbone latency 22%

Related reading

Flash-Decoding for long-context inference

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Unweight: how we compressed an LLM 22% without sacrificing quality

Semantic caching the VLM step in our product-photo pipeline

Winograd convolutions cost us 2 mAP and we didn't notice for a month

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…

Related reading

Flash-Decoding for long-context inference

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Unweight: how we compressed an LLM 22% without sacrificing quality

Semantic caching the VLM step in our product-photo pipeline

Winograd convolutions cost us 2 mAP and we didn't notice for a month

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…