FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Convolution models with long filters now have state-of-the-art reasoning abilities in many long-sequence tasks, from long-range language modeling to audio analysis and DNA modeling. But they lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT), which allows long convolutions to run in $O(N \log N)$ time in sequence length $N$ but has poor hardware utilization. We propose FlashFFTConv, a new algorithm for efficiently computing the FFT convolution on GPUs. FlashFFTConv speeds up convolutions by up to 7.93x over PyTorch and achieves up to 4.4x speedup end-to-end. Starting at sequence length 2K, FlashFFTConv starts to match the performance of FlashAttention-v2 – and outperforms it for longer sequences, achieving up to 62% MFU.‍Paper: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores. Daniel Y. Fu*, Hermann Kumbong*, Eric Nguyen, Christopher Ré. arXiv linkTry it now: https://github.com/HazyResearch/flash-fft-convFlashFFTConv: Efficient Convolutions for Long Sequences with Tensor CoresOver the past few years, we've seen convolutional sequence models take the world of long-sequence modeling by storm. From S4 and follow-ups like S5 and SS, to gated architectures like Monarch Mixer, BiGS, Hyena, and GSS, to hybrid architectures like Mega, H3, and Liquid-S4 – long convolutional models have had a major impact and are here to stay. These models are exciting because they are sub-quadratic in sequence length – thanks to the FFT convolution algorithm, they can be computed in $O(N \log N)$ time in sequence length $N$:‍def conv(u, k):

domenica 17 maggio 2026 New tab

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

Other newsrooms on this story

Related reading

FlashConv: Speeding up state space models

Flash-Decoding for long-context inference

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric…

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…

Flash Attention: what it does and why it matters

Other newsrooms on this story

Related reading

FlashConv: Speeding up state space models

Flash-Decoding for long-context inference

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric…

Chipmunk: Training-Free Acceleration of Diffusion Transformers with Dynamic…

Flash Attention: what it does and why it matters