Kog hits 3K t/s on MI300X, no kernel switches — test it now

AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely.

How the monokernel eliminates kernel-launch overhead

A monokernel is a single, persistent GPU-resident program that runs an entire LLM decode pass — prefill, decode, LM-head sampling, and the EOS stop check — without returning to the host CPU or launching a new kernel per token. Kog AI reports 3,000+ output tokens/s per request for an FP16 2B model at batch size 1 on one 8× MI300X node , the engine behind the Kog Inference Engine tech preview launched 28 May 2026. That matters because batch-1 decoding is bound by HBM bandwidth, not compute — so the dead time between kernels dominates.

Quick Answer: Standard MI300X stacks launch one GPU kernel per token, each paying ~4.5 μs launch overhead plus HBM restart latency. Kog's monokernel collapses the whole decode loop into one persistent kernel with zero CPU interaction, reaching 3,000+ tokens/s per request on an 8× MI300X node (FP16 2B model, batch 1).

Conventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes :

AMD's MI300X has long had more single-request inference headroom than the default ROCm stack exposes. A Paris startup just showed how much — by deleting the per-token kernel launch entirely.

How the monokernel eliminates kernel-launch overhead

Conventional stacks — vLLM, SGLang, ROCm/HIP pipelines — launch a fresh kernel for every stage of every token. Kog quantifies the recurring tax that removes :

Kog hits 3K t/s on MI300X, no kernel switches — test it now

Kog hits 3K t/s on MI300X, no kernel switches — test it now

Other newsrooms on this story

Related reading

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER…

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs

Kimi K2.5 runs on RTX 3060 with 768GB Intel Optane memory at 4 tokens per second

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude - Decrypt

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging…

Other newsrooms on this story

Related reading

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER…

ThunderKittens Now Optimized for NVIDIA Blackwell GPUs

Kimi K2.5 runs on RTX 3060 with 768GB Intel Optane memory at 4 tokens per second

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude - Decrypt

Intel XPU Kernel Skill: LLM-driven Triton kernel optimization for the Hugging…