DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

TL;DR Summary

DeepSeek-V3 is a 671B parameter Mixture-of-Experts model with only 37B activated per token — rivaling GPT-4o and Claude 3.5 Sonnet on benchmarks

Trained on 14.8 trillion tokens using innovative FP8 mixed precision — only 2.664M H800 GPU hours for full pre-training, with zero irrecoverable loss spikes

104k GitHub stars, MIT license, commercial use allowed — open weights available on Hugging Face