MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

MoonMath AI team has released a bf16 forward attention kernel for AMD’s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD’s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider.

Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.

TL;DR

MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).

It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×.

Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only.

TL;DR

MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT).

It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×.

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode

Other newsrooms on this story

Related reading

Exclusive: Mindbeam touts dramatic performance improvements in CPU-based AI…

Kog hits 3K t/s on MI300X, no kernel switches — test it now

AMD teams up with OpenAI to challenge Nvidia’s AI chip dominance

AMD puts out new slottable GPU for AI-curious enterprises

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging…

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

Other newsrooms on this story

Related reading

Exclusive: Mindbeam touts dramatic performance improvements in CPU-based AI…

Kog hits 3K t/s on MI300X, no kernel switches — test it now

AMD teams up with OpenAI to challenge Nvidia’s AI chip dominance

AMD puts out new slottable GPU for AI-curious enterprises

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging…

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.