DeepSeek just gave every AI company in the world a reason to reconsider its next GPU purchase order. The Chinese AI lab launched DSpark on June 27, an open-source speculative decoding module that bolts onto existing model checkpoints and delivers generation speed improvements of 57% to 85% over previous baselines. In some benchmarks, throughput gains hit 400%.

No retraining required. No quantization hacks. Just a software layer that makes the hardware you already own work significantly harder.

What DSpark actually does

Think of DSpark as a turbocharger for AI inference. Instead of generating tokens one at a time, the framework uses semi-autoregressive drafting to propose entire blocks of tokens, then verifies them in parallel. A confidence head decides which draft tokens are likely correct, and a hardware-aware scheduler routes the workload to whatever chip architecture is available.

The module ships as an attachable layer for DeepSeek-V4 checkpoints, specifically V4-Pro-DSpark and V4-Flash-DSpark variants. But compatibility extends beyond DeepSeek’s own models. Performance improvements have been documented on architectures like Qwen and Gemma as well.