Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging Face Kernel Hub

Back to Articles

Why a kernel skill How it works The trial loop The environment Knowledge base Evaluation Flash Attention forward (fp16) Production Triton kernels: vLLM attention & MoE Breadth: KernelBench Level 2 Try it yourself 1. Install the skill 2. Let the agent write the kernel 3. Build, publish, and load it Links Citation Limitations & future work By Intel DCG AI Software and OCTO Parallel Computing Lab

Xe-Forge (Spoczynski et al., 2026) is an Intel project that uses an LLM to optimize Triton kernels for Intel Arc Pro GPUs (Xe2). It works through a sequence of optimization stages — fusion, dtype fixes, memory access, block pointers, XPU-specific tuning, autotuning — in a loop called CoVeR (Chain-of-Verification-and-Refinement) that runs each candidate on the GPU and iterates whenever one fails or regresses. A small knowledge base of Xe2-specific patterns (tensor descriptors, GRF mode 256, tile swizzling) is read at the start of each session because these aren't well-represented in LLM training data.

On Arc Pro B70, Xe-Forge delivers a 1.26× geomean speedup over PyTorch eager across the full 100 KernelBench Level-2 kernels (69% win rate), a 2.8× geomean over vLLM's production attention and MoE Triton kernels, and up to 13.3× on Flash Attention forward.

Back to Articles

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging Face Kernel Hub

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging Face Kernel Hub

Other newsrooms on this story

Related reading

Pallas for people who know JAX but not kernels yet

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

TPU Developer Hub: A Technical Review of a High-Performance AI Platform

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER…

Kog hits 3K t/s on MI300X, no kernel switches — test it now

TPUs vs GPUs: How Google's Tensor Processing Units Actually Work

Related reading

Pallas for people who know JAX but not kernels yet

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

TPU Developer Hub: A Technical Review of a High-Performance AI Platform

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER…

Kog hits 3K t/s on MI300X, no kernel switches — test it now

TPUs vs GPUs: How Google's Tensor Processing Units Actually Work

Other newsrooms on this story