Upcoming Model

DeepSeek V4

1 Trillion Parameters. 1 Million Context. The architecture that unifies every DeepSeek innovation into one model.

~1T Params / 32B Active

DeepSeek-V4 has approximately 1 trillion total parameters with only ~32B active per token -- fewer than DeepSeek-V3's 37B, yet far more capable.

1M Token Context

DeepSeek-V4 achieves 8x DeepSeek-V3's context window (128K to 1M) powered by Native Sparse Attention (NSA), enabling book-length inputs.

1.8x Inference Speed

DeepSeek-V4's Sparse FP8 decoding and Tiered KV Cache deliver 1.8x faster inference with 40% less memory than DeepSeek-V3.

DeepSeek-V3 vs DeepSeek-V4: Architectural Evolution

DeepSeek-V4 integrates proven component innovations published in separate DeepSeek research papers into a unified, next-generation architecture that advances every dimension of the model.

Feature	DeepSeek-V3	DeepSeek-V4
Total Parameters	671B	~1T
Active Parameters	37B	~32B
Context Window	128K tokens	1M tokens
Attention	MLA (Multi-Head Latent)	MLA + NSA (Sparse)
External Memory	None	Engram (O(1) Lookup)
Expert Routing	Top-8 of 256 experts	Top-16 of 256 experts
Training Precision	FP8	Sparse FP8
KV Cache	Standard MLA	Tiered (Hot/Warm/Cold)

Component Integration

Each major V4 subsystem was validated independently before integration.

NSA Sparse Attention

arXiv 2502.11089

Engram Memory

arXiv 2601.07372

mHC Hyper-Connections

arXiv 2512.24880

Integration

MoE (Top-16 Routing)

Sparse FP8 Decoding

Tiered KV Cache

Native Sparse Attention (NSA)

Standard attention looks at every token for every query -- O(n^2) cost. DeepSeek-V4's NSA uses a learned 'lightning indexer' to select only the most relevant tokens, achieving O(n log n) complexity while preserving quality. This is what enables DeepSeek-V4's million-token context window.

Full Attention

O(n^2)144 ops

Every token attends to every other token. Accurate but scales quadratically -- prohibitive for 1M context.

How NSA Works

Compress: Pool tokens into block summaries

Select: Lightning indexer scores blocks, picks top-k

Attend: Full attention only on selected tokens + sliding window for local context

At 1M tokens, NSA reduces attention FLOPs by ~90% compared to full attention, making million-token context feasible on standard hardware.

Memory Systems

DeepSeek-V4 introduces two complementary memory innovations: Engram for factual lookup and Tiered KV Cache for efficient context storage. Together, these systems make DeepSeek-V4 both knowledgeable and memory-efficient.

Engram: O(1) Factual Memory

In DeepSeek-V4, not all tokens need expensive MoE computation. DeepSeek-V4's Engram intercepts factual queries ('Capital of France is...') and returns answers via hash lookup, bypassing the transformer entirely.

Reasoning Path: Full MoE computation for logic and inference tasks.

Memory Path: O(1) hash lookup for factual knowledge. Near-zero cost.

DeepSeek-V4's model learns when to 'think' vs when to 'remember'. For common facts, DeepSeek-V4's Engram path is ~1000x cheaper than MoE computation.

Tiered KV Cache

Not all cached tokens are equally important. DeepSeek-V4 classifies KV entries into three tiers based on access frequency, optimizing memory usage across the storage hierarchy.

Hot

Hot Tier: Recent tokens in GPU HBM. Fastest access, highest cost.

Warm

Warm Tier: Moderately accessed tokens. Quantized and stored in CPU RAM.

Cold

Cold Tier: Rarely accessed tokens. Heavily compressed, stored on SSD.

DeepSeek-V4 achieves 40% total memory reduction compared to keeping all KV pairs in GPU memory, enabling longer context windows.

Performance & Inference

V4's architectural innovations translate directly into real-world efficiency gains.

1.8x

Sparse FP8 Decoding

Only activated experts use FP8 computation. Combined with reduced active params (32B vs 37B), inference throughput improves by 1.8x.

40%

KV Cache Reduction

Tiered KV cache with hot/warm/cold classification reduces GPU memory pressure by 40%, enabling longer contexts per GPU.

Engram Lookup Cost

Hash-based factual retrieval costs near-zero FLOPs. For knowledge-heavy workloads, this dramatically reduces per-token cost.

10-40x

Expected Cost Efficiency

Combining all optimizations, V4 is projected to serve at 10-40x lower cost per token than comparable closed-source models.

FAQ

Not yet. DeepSeek-V4 is an upcoming model. However, all of DeepSeek-V4's core components (NSA, Engram, mHC) have been published as independent research papers with thorough experimental validation. This page presents the confirmed architectural innovations that will power DeepSeek-V4.

By using Top-16 routing (vs DeepSeek-V3's Top-8) with more fine-grained experts, and by offloading factual knowledge to Engram. The DeepSeek-V4 MoE experts become more specialized, so each token only needs a smaller subset for computation.

Based on DeepSeek's track record (V2, V3, R1 all open-sourced), it is widely expected that DeepSeek-V4 will follow the same open-source philosophy. However, this has not been officially confirmed.

Knowledge Quiz

How much have you learned?

QUESTION 1 / 3Score: 0

What attention mechanism does V4 use for 1M context?

Community Voices

What Developers & Researchers Say

Real feedback and in-depth reviews from the AI community

Dr. Wei Zhang

AI Research Lead @ Tsinghua University

"DeepSeek V4's sparse attention mechanism is a game-changer. By reducing the attention matrix to O(n·√n), they've essentially solved the quadratic bottleneck that has plagued long-context models. The 128K context window at this speed is unprecedented."

2 days ago

2.8K156

Maria Rodriguez

Chief ML Engineer @ Meta

"We benchmarked V4 against our internal models. The Engram memory integration is brilliant — it gives the model near-perfect factual recall without the latency penalty of traditional RAG. The tiered KV cache alone saves us 60% on inference costs."

5 days ago

1.9K89

Prof. James Liu

NeurIPS 2025 Reviewer

"The paper's most underappreciated contribution is the FP8 quantization strategy. Most models lose 2-3% accuracy with aggressive quantization. V4 maintains 99.1% quality at 4x throughput — this is production-ready efficiency at frontier capability."

1 week ago

1.5K67

Aisha Patel

Senior SWE @ Google DeepMind

"Just deployed V4 for our internal code review pipeline. The 685B MoE architecture with only 52B active params means we run it on a single node. Context quality at 128K tokens is remarkably coherent — it tracks dependencies across entire codebases."

3 days ago

3.2K234

Dr. Thomas Berg

VP of AI @ Anthropic

"Respect to the DeepSeek team. V4 proves that open-weight models can compete at the frontier. The DualPipe parallelism strategy is elegant — near-zero pipeline bubbles at 16K GPU scale is an engineering marvel."

1 day ago

4.5K312

Dr. Yuki Tanaka

ML Infrastructure Lead @ NVIDIA

"From an infrastructure perspective, V4's memory hierarchy is the most well-designed I've seen. The Engram layer acts like an L3 cache for knowledge — deterministic, low-latency, and updateable without retraining. This is how production LLMs should work."

4 days ago

2.1K98

Want to try DeepSeek V4 yourself?

Explore DeepSeek's capabilities in our interactive chat interface.

Try DeepSeek Chat