AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely.

How the monokernel eliminates kernel-launch overhead

A monokernel is a single, persistent GPU-resident program that runs an entire LLM decode pass — prefill, decode, LM-head sampling, and the EOS stop check — without returning to the host CPU or launching a new kernel per token. Kog AI reports 3,000+ output tokens/s per request for an FP16 2B model at batch size 1 on one 8× MI300X node , the engine behind the Kog Inference Engine tech preview launched 28 May 2026. That matters because batch-1 decoding is bound by HBM bandwidth, not compute — so the dead time between kernels dominates.

Quick Answer: Standard MI300X stacks launch one GPU kernel per token, each paying ~4.5 μs launch overhead plus HBM restart latency. Kog's monokernel collapses the whole decode loop into one persistent kernel with zero CPU interaction, reaching 3,000+ tokens/s per request on an 8× MI300X node (FP16 2B model, batch 1).

Conventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes :