How Flash Attention eliminates the HBM bottleneck in attention by tiling Q, K, V into SRAM blocks — IO complexity, v1→v2→v3 evolution, FP8 support, and when it stops helping.