DeepSeek V3
Stronger, Faster, Cheaper. Exploring the architectural magic behind the DeepSeek-V3 671B parameter MoE model and how DeepSeek-V3 achieves frontier performance at a fraction of the cost.
MoE Architecture
DeepSeek-V3 has 671B total params with 37B active per token. DeepSeek-V3 uses Auxiliary-Loss-Free Load Balancing for optimal expert specialization without hurting model quality.
Multi-Token Prediction (MTP)
DeepSeek-V3 predicts not just the next token, but the one after. This teaches the DeepSeek-V3 model to 'plan ahead', providing denser supervision signals and doubling training efficiency.
Extreme Cost Efficiency
DeepSeek-V3 uses full FP8 mixed-precision training throughout. Trained in just 2.78M GPU hours ($5.6M), DeepSeek-V3 sets a new standard for cost-effectiveness in frontier model training.
DeepSeek-V3 Auxiliary-Loss-Free Load Balancing
The biggest pain point of MoE is 'Expert Collapse': all tasks want the strongest expert. Traditional methods penalize this congestion (Aux Loss), but it hurts model performance. DeepSeek-V3 takes a fundamentally different approach: dynamically add a 'Bias' score to each expert. If an expert is too busy, DeepSeek-V3 deducts points; if too free, DeepSeek-V3 adds points.
Unbalanced: Tokens flock to Expert A based on raw preference (Congestion). A is overloaded, B/C are idle.:
The DeepSeek-V3 Load Balancing Formula
Score = Affinity + Bias
V3 创新点在于:
1. Bias 仅用于路由 (Routing)
2. Affinity 用于计算 (Computing)
Why does this matter?
Previously, balancing required adding an Auxiliary Loss penalty to the total Loss. While this balanced experts, it 'interfered' with model learning. DeepSeek-V3 achieves both 'Balance + High Performance' simultaneously, which is a key reason why DeepSeek-V3 outperforms other MoE models.
DeepSeek-V3 Multi-Token Prediction (MTP)
Normal models are like walking in a maze, only seeing one step ahead. DeepSeek-V3 is like playing chess, predicting the 'next step' and the 'step after that' simultaneously. This multi-token prediction capability is central to what makes DeepSeek-V3 so efficient and is a key innovation of the DeepSeek-V3 architecture.
DeepSeek-V3 Input Token
Training Benefit: DeepSeek-V3's MTP gives the model denser supervision signals at each step. It learns not just 'what's next' but 'how to plan ahead', producing richer gradient information per training example.
Inference Speedup: During DeepSeek-V3 inference, we can keep the MTP module for 'Speculative Decoding'. DeepSeek-V3 guesses two words at once; if the speculation is correct, generation speed doubles!
DeepSeek-V3 FP8 Mixed Precision Training
How does DeepSeek-V3 make a 671B parameter model run fast? Answer: Don't use such 'precise' numbers. DeepSeek-V3 pioneered full FP8 mixed-precision training at this unprecedented scale, proving that reduced numerical precision does not sacrifice model quality when managed carefully.
BF16 (16-bit)
Traditional BF16 precision. High VRAM usage, slower calculation speed for DeepSeek-V3 scale models.
FP8 (8-bit)
Half the precision, double the speed. DeepSeek-V3 solved the instability issues that previously prevented full FP8 training at scale through careful quantization management and mixed-precision accumulation strategies.
DeepSeek-V3 Cost Miracle
DeepSeek-V3 was trained with only 2048 H800 GPUs in less than 2 months, costing just $5.6M.
FAQ
Knowledge Quiz
How much have you learned?
What strategy does DeepSeek-V3 use for MoE load balancing?
Want to try DeepSeek V3 yourself?
Explore DeepSeek's capabilities in our interactive chat interface.
Try DeepSeek Chat