DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

TL;DR Summary

DeepSeek-V3 is a 671B parameter Mixture-of-Experts model with only 37B activated per token — rivaling GPT-4o and Claude 3.5 Sonnet on benchmarks

Trained on 14.8 trillion tokens using innovative FP8 mixed precision — only 2.664M H800 GPU hours for full pre-training, with zero irrecoverable loss spikes

104k GitHub stars, MIT license, commercial use allowed — open weights available on Hugging Face

DeepSeek-V3: The 671B MoE Model You Can Run Locally in 2026

Other newsrooms on this story

Related reading

What makes DeepSeek-V3.2 so efficient? - TechTalks

Build with DeepSeek V4 Using NVIDIA Blackwell and GPU-Accelerated Endpoints |…

DeepSeek's new open models give everyone a million-word memory by default

DeepSeek V4—almost on the frontier, a fraction of the price

DeepSeek releases 'sparse attention' model that cuts API costs in half |…

DeepSeek-V4: a million-token context that agents can actually use