TL;DRAI

AMD shipped ATOM + ATOMesh, a ROCm serving stack with prefill/decode disaggregation on separate GPU pools to balance opposite bottlenecks. Separates compute prefill from memory-bound decode, eliminating stalls and boosting utilization — vLLM competitor on AMD Instinct.

What: AMD shipped ATOM + ATOMesh, a ROCm-native LLM serving stack whose headline trick is prefill/decode disaggregation — splitting the two phases of inference onto separate pools of GPUs instead of crowding them onto one.

Why: Prefill and decode have opposite bottlenecks — prefill is compute-bound, decode is memory-bandwidth-bound — so running them on the same worker wastes hardware and lets one long prompt stall everyone else's token stream.

vs prior: A co-located server (vanilla single-pool vLLM) interleaves prefill and decode on the same GPUs; disaggregation runs each on its own pool tuned for its bottleneck, paying for it by shipping the KV cache across the interconnect between them.

Think of it as

A restaurant kitchen that splits the prep station from the plating line.

dev.to

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

What: AMD shipped ATOM + ATOMesh, a ROCm-native LLM serving stack whose headline trick is...

domenica 21 giugno 2026 New tab

TL;DRAI

1,765 words~8 min read

Think of it as

A restaurant kitchen that splits the prep station from the plating line.

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

AMD ATOM + ATOMesh: Prefill/decode Disaggregation on ROCm

Other newsrooms on this story

Related reading

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

Running AI on mixed hardware for speed and affordability

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…

Other newsrooms on this story

Related reading

Why DDR5 Bandwidth Kills Dual-LLM Inference on APUs (Benchmarks Inside)

Fused Kernels in LLMs: Reducing Memory Bandwidth Bottlenecks Through GPU Kernel…

Running AI on mixed hardware for speed and affordability

Solving the Decode Bottleneck: Why Agentic Inference Needs Hybrid Hardware

Running Mixtral 8x7B at 21+ TPS on Pure CPU via io_uring and Predictive Caching

Cache-aware prefill–decode disaggregation (CPD) for up to 40% faster…