DeepSeek-V3: State-of-the-Art Open Source Model

DeepSeek V3

Stronger, Faster, Cheaper. Exploring the architectural magic behind the DeepSeek-V3 671B parameter MoE model and how DeepSeek-V3 achieves frontier performance at a fraction of the cost.

MoE Architecture

DeepSeek-V3 has 671B total params with 37B active per token. DeepSeek-V3 uses Auxiliary-Loss-Free Load Balancing for optimal expert specialization without hurting model quality.

Multi-Token Prediction (MTP)

DeepSeek-V3 predicts not just the next token, but the one after. This teaches the DeepSeek-V3 model to 'plan ahead', providing denser supervision signals and doubling training efficiency.

Extreme Cost Efficiency

DeepSeek-V3 uses full FP8 mixed-precision training throughout. Trained in just 2.78M GPU hours ($5.6M), DeepSeek-V3 sets a new standard for cost-effectiveness in frontier model training.

DeepSeek-V3 Auxiliary-Loss-Free Load Balancing

The biggest pain point of MoE is 'Expert Collapse': all tasks want the strongest expert. Traditional methods penalize this congestion (Aux Loss), but it hurts model performance. DeepSeek-V3 takes a fundamentally different approach: dynamically add a 'Bias' score to each expert. If an expert is too busy, DeepSeek-V3 deducts points; if too free, DeepSeek-V3 adds points.

DeepSeek-V3 Token Routing Simulator
90%
Expert A
Bias: 0.00
16%
Expert B
Bias: 0.00
26%
Expert C
Bias: 0.00

Unbalanced: Tokens flock to Expert A based on raw preference (Congestion). A is overloaded, B/C are idle.:

The DeepSeek-V3 Load Balancing Formula

Score = Affinity + Bias

V3 创新点在于:
1. Bias 仅用于路由 (Routing)
2. Affinity 用于计算 (Computing)

Why does this matter?

Previously, balancing required adding an Auxiliary Loss penalty to the total Loss. While this balanced experts, it 'interfered' with model learning. DeepSeek-V3 achieves both 'Balance + High Performance' simultaneously, which is a key reason why DeepSeek-V3 outperforms other MoE models.

DeepSeek-V3 Multi-Token Prediction (MTP)

Normal models are like walking in a maze, only seeing one step ahead. DeepSeek-V3 is like playing chess, predicting the 'next step' and the 'step after that' simultaneously. This multi-token prediction capability is central to what makes DeepSeek-V3 so efficient and is a key innovation of the DeepSeek-V3 architecture.

DeepSeek-V3 Input Token

The
DeepSeek-V3 Transformer
Next TokenMain Head
capital
Next+1 Token (MTP)MTP Module
of
* Training Only / Speculative Decoding

Training Benefit: DeepSeek-V3's MTP gives the model denser supervision signals at each step. It learns not just 'what's next' but 'how to plan ahead', producing richer gradient information per training example.

Inference Speedup: During DeepSeek-V3 inference, we can keep the MTP module for 'Speculative Decoding'. DeepSeek-V3 guesses two words at once; if the speculation is correct, generation speed doubles!

DeepSeek-V3 FP8 Mixed Precision Training

How does DeepSeek-V3 make a 671B parameter model run fast? Answer: Don't use such 'precise' numbers. DeepSeek-V3 pioneered full FP8 mixed-precision training at this unprecedented scale, proving that reduced numerical precision does not sacrifice model quality when managed carefully.

BF16 (16-bit)

Traditional BF16 precision. High VRAM usage, slower calculation speed for DeepSeek-V3 scale models.

Storage: 16 bits per param
DeepSeek-V3 Standard

FP8 (8-bit)

Save
Save
Save
Save
Save
Save
Save
Save

Half the precision, double the speed. DeepSeek-V3 solved the instability issues that previously prevented full FP8 training at scale through careful quantization management and mixed-precision accumulation strategies.

Storage: 8 bits per param (50% Saving)

DeepSeek-V3 Cost Miracle

DeepSeek-V3 was trained with only 2048 H800 GPUs in less than 2 months, costing just $5.6M.

5.6M
Total Cost
2.8M
GPU Hours
10x
Cheaper than Peers

FAQ

It's hard to run the full DeepSeek-V3 directly. Even with 4-bit quantization, 671B parameters require massive VRAM. However, since DeepSeek-V3 uses an MoE architecture, only 37B parameters are activated per inference, so inference speed is actually fast. For individuals, DeepSeek-V3's distilled smaller versions are recommended.
This was a huge technical challenge. DeepSeek-V3 solved this with high-precision quantization strategies and careful mixed-precision management. Experiments show DeepSeek-V3's performance with full FP8 training is almost identical to traditional BF16 training, but VRAM usage is halved and calculation speed is doubled.
While MTP is mainly for training (providing denser supervision signals), preserving DeepSeek-V3's MTP module during inference enables 'Speculative Decoding'. DeepSeek-V3 can guess two words at once; if verified correct, generation speed doubles.
DeepSeek-V3 significantly outperforms other open-source models of similar scale. On benchmarks like MMLU, HumanEval, and MATH, DeepSeek-V3 matches or exceeds models that cost 10x more to train. The combination of Auxiliary-Loss-Free Load Balancing, MTP, and FP8 training makes DeepSeek-V3 uniquely efficient among frontier models.
DeepSeek-V3 achieves remarkable cost efficiency through three synergistic innovations: FP8 mixed-precision training halves memory and doubles compute speed; the Auxiliary-Loss-Free Load Balancing ensures all experts contribute meaningfully without wasting gradient signal; and DualPipe infrastructure overlaps computation with communication. Together, these allow DeepSeek-V3 to train in just 2.78M GPU hours — a fraction of comparable models.

Knowledge Quiz

How much have you learned?

QUESTION 1 / 3Score: 0

What strategy does DeepSeek-V3 use for MoE load balancing?

Want to try DeepSeek V3 yourself?

Explore DeepSeek's capabilities in our interactive chat interface.

Try DeepSeek Chat