Engram: Lossless External Memory
Why stuff an encyclopedia into neurons? DeepSeek Engram proposes an N-gram based lookup mechanism, fundamentally decoupling factual memory from logical reasoning.
Data is Memory
In the DeepSeek Engram architecture, knowledge is indexed in a massive N-gram database instead of being 'burned' into weights via backpropagation. This allows Engram to store vastly more factual information without increasing model parameters.
Focus on Reasoning
With DeepSeek Engram handling factual recall, model parameters (Transformer) focus only on grammar, logic, and complex reasoning — the tasks neural networks excel at.
Linear Interpolation
The final result in DeepSeek Engram is a weighted fusion of model prediction and memory lookup, combining the computational reasoning of Transformers with the instant recall of hash-based retrieval.
Core Problem: Calculation vs Lookup
Current AI (Transformers) handles all information the same way: expensive neural computation. DeepSeek Engram challenges this paradigm by recognizing that factual recall and logical reasoning are fundamentally different cognitive tasks.
Traditional AI Model
Task: Who was the first US president?
Engram Enhanced Model
Task: Who was the first US president?
How does DeepSeek Engram work?
DeepSeek Engram doesn't use complex neural networks for everything. Instead, Engram uses N-grams to identify common phrases and map them directly to memory addresses, bypassing expensive matrix multiplications entirely.
1. N-gram Split
DeepSeek Engram looks not just at individual words, but at multi-word combinations (e.g., 'Capital of'). These N-gram patterns contain fixed factual knowledge that can be looked up rather than computed.
2. Fast Hashing
Engram maps phrases to a numeric ID via a hash function. This process is deterministic and extremely fast with O(1) complexity — no neural network computation required.
3. Retrieve & Fuse
Engram fetches the corresponding 'knowledge vector' from the table using the hashed ID. If this retrieved knowledge is relevant to the current context, Engram blends it into the model's prediction.
Core Mechanism: Logit Mixing
The core formula of DeepSeek Engram: P = (1 - λ) * P_model + λ * P_engram How does Engram decide whether to 'think' or 'lookup'? Via a learnable gating coefficient λ (Lambda) that adapts based on context.
Transformer Model
P(Lyon) = 0.25
Engram Memory
P(Lyon) = 0.00
Paper insight: For rare knowledge (Long-tail), DeepSeek Engram's λ automatically increases (rely on memory); for reasoning tasks, λ decreases (rely on Transformer model). This adaptive gating is what makes Engram so powerful.
动手试试:Engram 关注哪里?
在下面的输入框打字。红色越深,代表 Engram 模块介入程度越高(即模型认为这是一个“死知识”,直接查表就行)。
Instant Domain Adaptation
Traditional LLMs need expensive continual pre-training to learn new domains (e.g., new laws or internal company documents). This process requires GPU clusters and can take days or weeks.
DeepSeek Engram changes this entirely. You just update the N-gram index table on disk, and the model 'sees' the new domain knowledge immediately without any neural network weight updates. This makes Engram ideal for enterprise deployments where knowledge changes frequently.
> Found in Engram Table.
> Interpolating... Done.
How is DeepSeek Engram different from RAG?
Many mistake DeepSeek Engram for RAG (Retrieval Augmented Generation). Though both address 'hallucinations', their implementation methods differ vastly. Engram operates at the token level with O(1) hash lookups, while RAG retrieves entire document chunks.
Traditional LLM
RAG
Engram
The Golden Ratio: U-Shaped Curve
The DeepSeek Engram paper discovered a key 'budget allocation' law. If total model parameters are fixed, there is an optimal split between neural computation and Engram memory:
- All MoE Experts (100%):
Smart, but wastes brainpower on trivia. - Hybrid (~80% Experts / 20% Memory):
The sweet spot (lowest Loss). This is the power of Engram.
Validation Loss (Lower is Better)
专家视点
学术界与工业界如何评价 Engram 架构?
"Engram 的 U 型曲线发现令人震惊。它不仅验证了'记忆'与'计算'可以解耦,更重要的是给出了具体的黄金比例 (80/20)。这意味着我们长期以来都在浪费算力去训练模型背诵百科全书。"
"对于企业级应用来说,Engram 的'即时领域适应'特性是杀手级的。无需重新预训练,只需更新 N-gram 索引表就能让模型掌握最新的私有数据,这极大地降低了 RAG 系统的复杂度和成本。"
"DeepSeek 这篇论文回归了本质:N-gram 这种看似古老的技术,在现代 Transformer 架构下焕发了新生。O(1) 复杂度的查表机制可能是解决 Long-tail 知识幻觉的最佳方案。"
FAQ
Knowledge Quiz
How much have you learned?
