DeepSeek R1
Don't teach it how to think. Just reward it, and it learns to think.
The "Aha!" Moment
In DeepSeek-R1-Zero training, researchers didn't teach the model how to do Chain-of-Thought. But to get higher rewards, DeepSeek-R1-Zero spontaneously learned to extend thinking time, even learning to use 'Wait...' to interrupt and correct itself — a genuine emergent behavior.
From Zero to One: The Magic of Cold Start
DeepSeek-R1-Zero was brilliant but messy (mixed languages, chaotic formatting). DeepSeek-R1 introduced a small amount of 'Cold Start Data' — just a few thousand high-quality examples — teaching the genius how to communicate elegantly while preserving its reasoning power. This Cold Start stage is what transforms DeepSeek-R1-Zero into the polished DeepSeek-R1.
DeepSeek-R1-Zero
Pure RLPure RL, no human guidance.
DeepSeek-R1
Cold StartRL + Cold Start SFT.
Insight: Just a few thousand high-quality CoT examples are enough to guide RL towards human-readable reasoning without losing performance.
GRPO: Ditching the Critic
Traditional RLHF (PPO) needs a huge Critic model to evaluate every step. DeepSeek-R1 proposes GRPO: generate a group of answers for one question, then compare them relative to each other. This eliminates the need for a separate reward model entirely.
Algorithm Demo
Core Formula
GRPO doesn't need an extra Critic network. It simply asks: 'In this group, who performed better than average?' Better ones are encouraged, worse ones suppressed.
How DeepSeek-R1 was forged
DeepSeek-R1 is not a one-step magic trick, but a precise 4-stage training pipeline that progressively builds reasoning capability. Each stage of the DeepSeek-R1 pipeline adds a specific layer of ability, from basic formatting to complex multi-step reasoning.
1. Cold Start
使用少量高质量的'长链思维'(CoT) 数据微调 Base 模型。这让模型有了基本的推理概念,能看懂 <think> 标签。
Technical Detail
Data: Small amount of high-quality CoT data.
Rivaling Proprietary Giants
DeepSeek-R1 matches or exceeds OpenAI o1-1217 in core reasoning benchmarks, demonstrating DeepSeek-R1's world-class reasoning ability.
Note: Data from DeepSeek-R1 paper.
Knowledge Distillation
DeepSeek-R1's reasoning power can be 'taught' to smaller models through knowledge distillation. By fine-tuning on 800k samples generated by DeepSeek-R1, smaller models gain remarkable reasoning skills that approach the full DeepSeek-R1's capabilities. The DeepSeek-R1 distilled models range from 1.5B to 70B parameters, making DeepSeek-R1's breakthroughs accessible on consumer hardware.
DeepSeek-R1
671B Parameters (MoE)
Distill-Qwen-32B
AIME 2024: Better than O1-mini.
Distill-Llama-70B
Reasoning transfer across architectures.
FAQ
Knowledge Quiz
How much have you learned?
