Deepseek has released DSpark, a new method that boosts per-user response speed for its AI models by 60 to 85 percent, according to the company.

Most LLMs generate text one word at a time. That leads to low GPU utilization and long wait times for lengthy responses, Deepseek says. Its new framework, DSpark, uses speculative decoding, where a small, lightweight model proposes answer candidates that the larger model then checks in batches. It also generates small word groups instead of single tokens, boosting overall efficiency. A confidence-based system adjusts verification depth on the fly depending on compute load, cutting wasted processing on rejected token proposals.

Throughput vs. per-user generation speed (TPS) for DeepSeek-V4-Flash and DeepSeek-V4-Pro under live traffic. DSpark (green) pushes the performance frontier for both throughput and interactivity well beyond the MTP baseline (blue). | Image: Deepseek

Deepseek also tested DSpark with open models from Google DeepMind (Gemma) and Alibaba (Qwen), suggesting the approach works broadly. The framework and Deepseek-V4-Pro model, developed jointly with Peking University, are available on Hugging Face and GitHub under the MIT license. Technical details are in the paper.