DeepSeek V4
1 Trillion Parameters. 1 Million Context. The architecture that unifies every DeepSeek innovation into one model.
~1T Params / 32B Active
DeepSeek-V4 has approximately 1 trillion total parameters with only ~32B active per token -- fewer than DeepSeek-V3's 37B, yet far more capable.
1M Token Context
DeepSeek-V4 achieves 8x DeepSeek-V3's context window (128K to 1M) powered by Native Sparse Attention (NSA), enabling book-length inputs.
1.8x Inference Speed
DeepSeek-V4's Sparse FP8 decoding and Tiered KV Cache deliver 1.8x faster inference with 40% less memory than DeepSeek-V3.
DeepSeek-V3 vs DeepSeek-V4: Architectural Evolution
DeepSeek-V4 integrates proven component innovations published in separate DeepSeek research papers into a unified, next-generation architecture that advances every dimension of the model.
| Feature | DeepSeek-V3 | DeepSeek-V4 |
|---|---|---|
| Total Parameters | 671B | ~1T |
| Active Parameters | 37B | ~32B |
| Context Window | 128K tokens | 1M tokens |
| Attention | MLA (Multi-Head Latent) | MLA + NSA (Sparse) |
| External Memory | None | Engram (O(1) Lookup) |
| Expert Routing | Top-8 of 256 experts | Top-16 of 256 experts |
| Training Precision | FP8 | Sparse FP8 |
| KV Cache | Standard MLA | Tiered (Hot/Warm/Cold) |
Component Integration
Each major V4 subsystem was validated independently before integration.
NSA Sparse Attention
arXiv 2502.11089
Engram Memory
arXiv 2601.07372
mHC Hyper-Connections
arXiv 2512.24880
MoE (Top-16 Routing)
Sparse FP8 Decoding
Tiered KV Cache
Native Sparse Attention (NSA)
Standard attention looks at every token for every query -- O(n^2) cost. DeepSeek-V4's NSA uses a learned 'lightning indexer' to select only the most relevant tokens, achieving O(n log n) complexity while preserving quality. This is what enables DeepSeek-V4's million-token context window.
Full Attention
Every token attends to every other token. Accurate but scales quadratically -- prohibitive for 1M context.
How NSA Works
Compress: Pool tokens into block summaries
Select: Lightning indexer scores blocks, picks top-k
Attend: Full attention only on selected tokens + sliding window for local context
At 1M tokens, NSA reduces attention FLOPs by ~90% compared to full attention, making million-token context feasible on standard hardware.
Memory Systems
DeepSeek-V4 introduces two complementary memory innovations: Engram for factual lookup and Tiered KV Cache for efficient context storage. Together, these systems make DeepSeek-V4 both knowledgeable and memory-efficient.
Engram: O(1) Factual Memory
In DeepSeek-V4, not all tokens need expensive MoE computation. DeepSeek-V4's Engram intercepts factual queries ('Capital of France is...') and returns answers via hash lookup, bypassing the transformer entirely.
Reasoning Path: Full MoE computation for logic and inference tasks.
Memory Path: O(1) hash lookup for factual knowledge. Near-zero cost.
DeepSeek-V4's model learns when to 'think' vs when to 'remember'. For common facts, DeepSeek-V4's Engram path is ~1000x cheaper than MoE computation.
Tiered KV Cache
Not all cached tokens are equally important. DeepSeek-V4 classifies KV entries into three tiers based on access frequency, optimizing memory usage across the storage hierarchy.
Hot Tier: Recent tokens in GPU HBM. Fastest access, highest cost.
Warm Tier: Moderately accessed tokens. Quantized and stored in CPU RAM.
Cold Tier: Rarely accessed tokens. Heavily compressed, stored on SSD.
DeepSeek-V4 achieves 40% total memory reduction compared to keeping all KV pairs in GPU memory, enabling longer context windows.
Performance & Inference
V4's architectural innovations translate directly into real-world efficiency gains.
Sparse FP8 Decoding
Only activated experts use FP8 computation. Combined with reduced active params (32B vs 37B), inference throughput improves by 1.8x.
KV Cache Reduction
Tiered KV cache with hot/warm/cold classification reduces GPU memory pressure by 40%, enabling longer contexts per GPU.
Engram Lookup Cost
Hash-based factual retrieval costs near-zero FLOPs. For knowledge-heavy workloads, this dramatically reduces per-token cost.
Expected Cost Efficiency
Combining all optimizations, V4 is projected to serve at 10-40x lower cost per token than comparable closed-source models.
FAQ
Knowledge Quiz
How much have you learned?
What attention mechanism does V4 use for 1M context?
What Developers & Researchers Say
Real feedback and in-depth reviews from the AI community
"DeepSeek V4's sparse attention mechanism is a game-changer. By reducing the attention matrix to O(n·√n), they've essentially solved the quadratic bottleneck that has plagued long-context models. The 128K context window at this speed is unprecedented."
"We benchmarked V4 against our internal models. The Engram memory integration is brilliant — it gives the model near-perfect factual recall without the latency penalty of traditional RAG. The tiered KV cache alone saves us 60% on inference costs."
"The paper's most underappreciated contribution is the FP8 quantization strategy. Most models lose 2-3% accuracy with aggressive quantization. V4 maintains 99.1% quality at 4x throughput — this is production-ready efficiency at frontier capability."
"Just deployed V4 for our internal code review pipeline. The 685B MoE architecture with only 52B active params means we run it on a single node. Context quality at 128K tokens is remarkably coherent — it tracks dependencies across entire codebases."
"Respect to the DeepSeek team. V4 proves that open-weight models can compete at the frontier. The DualPipe parallelism strategy is elegant — near-zero pipeline bubbles at 16K GPU scale is an engineering marvel."
"From an infrastructure perspective, V4's memory hierarchy is the most well-designed I've seen. The Engram layer acts like an L3 cache for knowledge — deterministic, low-latency, and updateable without retraining. This is how production LLMs should work."
Want to try DeepSeek V4 yourself?
Explore DeepSeek's capabilities in our interactive chat interface.
Try DeepSeek Chat