Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

Speculative decoding cut LLM inference latency by predicting multiple tokens ahead and validating them with the base model. It works — but you pay for it with a separate draft model, a second KV cache, and acceptance rates that fall off when the drafter misreads the distribution. Orthrus is a research direction that aims for the same speedup without those overheads. It bolts a trainable diffusion attention module onto each layer of a frozen autoregressive Transformer and uses it to emit blocks of tokens in parallel.

The claim that should catch a developer's eye: 32 tokens per forward pass, while the base model's output distribution stays mathematically identical. If the math holds in practice, you get parallel generation without the "is the drafter agreeing with the target" hand-wringing that defines speculative decoding.

This is still early research, not a pip install. The architecture is worth understanding anyway, because it points at a different design space for self-hosted inference — one where the speedup comes from inside the model, not from a separate drafter running next to it.

How Orthrus generates tokens in parallel

The base Transformer stays frozen. Orthrus inserts a diffusion attention module at each layer that operates on a set of placeholder positions — a block of 32 future tokens in the published configuration. During inference, the diffusion module iteratively refines those placeholders into concrete tokens through a small number of denoising steps that share the existing layer activations.

How Orthrus generates tokens in parallel

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

Orthrus: Parallel Token Generation That Doesn't Change Your Model's Output

Other newsrooms on this story

Related reading

Speculative decoding for high-throughput long-context inference

Why MoE models get more from speculative decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Consistency diffusion language models: Up to 14x faster inference without…

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…

Other newsrooms on this story

Related reading

Speculative decoding for high-throughput long-context inference

Why MoE models get more from speculative decoding

SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference…

Boosting DeepSeek-R1’s Speed with Customized Speculative Decoding

Consistency diffusion language models: Up to 14x faster inference without…

NVIDIA AI Releases Nemotron-Labs-Diffusion: A Tri-Mode Language Model with 6×…