Back to Articles
Why a kernel skill How it works The trial loop The environment Knowledge base Evaluation Flash Attention forward (fp16) Production Triton kernels: vLLM attention & MoE Breadth: KernelBench Level 2 Try it yourself 1. Install the skill 2. Let the agent write the kernel 3. Build, publish, and load it Links Citation Limitations & future work By Intel DCG AI Software and OCTO Parallel Computing Lab
Xe-Forge (Spoczynski et al., 2026) is an Intel project that uses an LLM to optimize Triton kernels for Intel Arc Pro GPUs (Xe2). It works through a sequence of optimization stages — fusion, dtype fixes, memory access, block pointers, XPU-specific tuning, autotuning — in a loop called CoVeR (Chain-of-Verification-and-Refinement) that runs each candidate on the GPU and iterates whenever one fails or regresses. A small knowledge base of Xe2-specific patterns (tensor descriptors, GRF mode 256, tile swizzling) is read at the start of each session because these aren't well-represented in LLM training data.
On Arc Pro B70, Xe-Forge delivers a 1.26× geomean speedup over PyTorch eager across the full 100 KernelBench Level-2 kernels (69% win rate), a 2.8× geomean over vLLM's production attention and MoE Triton kernels, and up to 13.3× on Flash Attention forward.









