Incentivizing Reasoning via Reinforcement Learning

DeepSeek R1

Don't teach it how to think. Just reward it, and it learns to think.

Pure RL Training

DeepSeek-R1-Zero proved that reasoning capabilities like Self-Reflection and Long Chain-of-Thought emerge naturally from pure reinforcement learning, without any supervised fine-tuning data. This emergence in DeepSeek-R1-Zero was a groundbreaking discovery.

GRPO Algorithm

DeepSeek-R1 abandons the traditional Critic model for Group Relative Policy Optimization (GRPO), drastically reducing training costs while maintaining strong reasoning performance. DeepSeek-R1's GRPO compares outputs within a group rather than using an absolute value function.

Performance Monster

DeepSeek-R1 rivals OpenAI o1 on benchmarks like Math-500 and AIME 2024, and is fully open source — a milestone for the open AI community.

The "Aha!" Moment

In DeepSeek-R1-Zero training, researchers didn't teach the model how to do Chain-of-Thought. But to get higher rewards, DeepSeek-R1-Zero spontaneously learned to extend thinking time, even learning to use 'Wait...' to interrupt and correct itself — a genuine emergent behavior.

1. Early Explore (Random Guess)Low/Mid Reward
User
Solve x + 2 = 4
R1
The equation is x + 2 = 4. I guess x is 1? No maybe 3. The answer is 5.
Result: Incorrect ❌

From Zero to One: The Magic of Cold Start

DeepSeek-R1-Zero was brilliant but messy (mixed languages, chaotic formatting). DeepSeek-R1 introduced a small amount of 'Cold Start Data' — just a few thousand high-quality examples — teaching the genius how to communicate elegantly while preserving its reasoning power. This Cold Start stage is what transforms DeepSeek-R1-Zero into the polished DeepSeek-R1.

DeepSeek-R1-Zero

Pure RL

Pure RL, no human guidance.

Wait... the integral of x^2 is x^3/3... uh no, integrating limits... maybe Newton-Leibniz formula. Let me check... result is 1/3. (Mixed languages, jumping thoughts, hard to read)
User Choice

DeepSeek-R1

Cold Start

RL + Cold Start SFT.

<think> 1. Identify the function: f(x) = x^2. 2. Apply power rule for integration: ∫x^n dx = x^(n+1)/(n+1). 3. Calculate definite integral from 0 to 1: [x^3/3] from 0 to 1 = 1/3 - 0 = 1/3. </think> The answer is 1/3.

Insight: Just a few thousand high-quality CoT examples are enough to guide RL towards human-readable reasoning without losing performance.

GRPO: Ditching the Critic

Traditional RLHF (PPO) needs a huge Critic model to evaluate every step. DeepSeek-R1 proposes GRPO: generate a group of answers for one question, then compare them relative to each other. This eliminates the need for a separate reward model entirely.

Algorithm Demo

Input Prompt
Solve: 2x + 3 = 7
Output 1: x = 1 (Wrong)
Reward: 0
Output 2: x = 2 (Correct)
Reward: 1
Output 3: x = 2 (Correct)
Reward: 1
Output 4: x = 5 (Wrong)
Reward: 0

Core Formula

A_i = (Reward_i - Mean(Group)) / Std(Group)

GRPO doesn't need an extra Critic network. It simply asks: 'In this group, who performed better than average?' Better ones are encouraged, worse ones suppressed.

PPO
Critic Model Needed
High Cost 💰
R1 Choice
GRPO
No Critic Model
Efficient 🚀

How DeepSeek-R1 was forged

DeepSeek-R1 is not a one-step magic trick, but a precise 4-stage training pipeline that progressively builds reasoning capability. Each stage of the DeepSeek-R1 pipeline adds a specific layer of ability, from basic formatting to complex multi-step reasoning.

1. Cold Start

使用少量高质量的'长链思维'(CoT) 数据微调 Base 模型。这让模型有了基本的推理概念,能看懂 <think> 标签。

Technical Detail

Data: Small amount of high-quality CoT data.

Start
End

Rivaling Proprietary Giants

DeepSeek-R1 matches or exceeds OpenAI o1-1217 in core reasoning benchmarks, demonstrating DeepSeek-R1's world-class reasoning ability.

AIME 2024MATH-500CodeforcesMMLU
OpenAI o1-1217
DeepSeek-R1

Note: Data from DeepSeek-R1 paper.

Knowledge Distillation

DeepSeek-R1's reasoning power can be 'taught' to smaller models through knowledge distillation. By fine-tuning on 800k samples generated by DeepSeek-R1, smaller models gain remarkable reasoning skills that approach the full DeepSeek-R1's capabilities. The DeepSeek-R1 distilled models range from 1.5B to 70B parameters, making DeepSeek-R1's breakthroughs accessible on consumer hardware.

Teacher Model

DeepSeek-R1

671B Parameters (MoE)

Self-ReflectionLong CoT

Distill-Qwen-32B

AIME 2024: Better than O1-mini.

Distill-Llama-70B

Reasoning transfer across architectures.

FAQ

Because DeepSeek-R1-Zero is trained with Pure RL without human SFT. It only cares about getting the right answer, not the format. This chaos actually proves that reasoning capabilities emerged spontaneously in DeepSeek-R1-Zero, not taught by humans.
Cost efficiency. Traditional PPO requires a Critic model as large as the main model, consuming huge VRAM. DeepSeek-R1's GRPO uses a 'group competition' mechanism (comparing a group of outputs) to eliminate the Critic model, making large-scale RL training feasible.
Not entirely. In DeepSeek-R1, we use a small amount of Cold Start Data to guide the model to use this tag. But in DeepSeek-R1-Zero (Pure RL), the model spontaneously learned to extend its output for thinking. The essence of the Chain of Thought is the same.
DeepSeek-R1's distilled smaller models achieve impressive reasoning performance, though there is a gap compared to the full DeepSeek-R1. The distilled 32B and 70B variants of DeepSeek-R1 retain most of the reasoning capabilities while being practical to run on consumer hardware, making DeepSeek-R1's innovations accessible to a much wider audience.

Knowledge Quiz

How much have you learned?

QUESTION 1 / 3Score: 0

What interesting phenomenon occurred during DeepSeek-R1-Zero training?