DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026
TL;DR Summary
DeepSeek-V3 is a 671B parameter Mixture-of-Experts model with only 37B activated per token — rivaling GPT-4o and Claude 3.5 Sonnet on benchmarks
Trained on 14.8 trillion tokens using innovative FP8 mixed precision — only 2.664M H800 GPU hours for full pre-training, with zero irrecoverable loss spikes
104k GitHub stars, MIT license, commercial use allowed — open weights available on Hugging Face










