Parametric Memory (Weights)+Non-Parametric Memory (N-grams)

Engram: Lossless External Memory

Why stuff an encyclopedia into neurons? DeepSeek Engram proposes an N-gram based lookup mechanism, fundamentally decoupling factual memory from logical reasoning.

Data is Memory

In the DeepSeek Engram architecture, knowledge is indexed in a massive N-gram database instead of being 'burned' into weights via backpropagation. This allows Engram to store vastly more factual information without increasing model parameters.

Focus on Reasoning

With DeepSeek Engram handling factual recall, model parameters (Transformer) focus only on grammar, logic, and complex reasoning — the tasks neural networks excel at.

Linear Interpolation

The final result in DeepSeek Engram is a weighted fusion of model prediction and memory lookup, combining the computational reasoning of Transformers with the instant recall of hash-based retrieval.

Core Problem: Calculation vs Lookup

Current AI (Transformers) handles all information the same way: expensive neural computation. DeepSeek Engram challenges this paradigm by recognizing that factual recall and logical reasoning are fundamentally different cognitive tasks.

Traditional AI Model

Task: Who was the first US president?

1. ...
2. ...
3. ...
Result: George Washington (High compute cost)
Low Efficiency: Wasting brainpower on rote memorization

Engram Enhanced Model

Task: Who was the first US president?

Identify common phrase...
Hash Lookup
Result: George Washington (Zero compute)
High Efficiency: As fast as a dictionary lookup

How does DeepSeek Engram work?

DeepSeek Engram doesn't use complex neural networks for everything. Instead, Engram uses N-grams to identify common phrases and map them directly to memory addresses, bypassing expensive matrix multiplications entirely.

DEMO MODE
The
Capital
of
France
is
The Capital (2-gram)
Capital of (2-gram)
of France (2-gram)
France is (2-gram)
Hash(x)
Engram Memory Table
0x3F1AFound![0.1, 0.5, ...]
0x8B2C...[0.9, 0.2, ...]
0x1D9E...[0.3, 0.8, ...]
0x4A7B...[0.4, 0.1, ...]
1

1. N-gram Split

DeepSeek Engram looks not just at individual words, but at multi-word combinations (e.g., 'Capital of'). These N-gram patterns contain fixed factual knowledge that can be looked up rather than computed.

2

2. Fast Hashing

Engram maps phrases to a numeric ID via a hash function. This process is deterministic and extremely fast with O(1) complexity — no neural network computation required.

3

3. Retrieve & Fuse

Engram fetches the corresponding 'knowledge vector' from the table using the hashed ID. If this retrieved knowledge is relevant to the current context, Engram blends it into the model's prediction.

Core Mechanism: Logit Mixing

The core formula of DeepSeek Engram: P = (1 - λ) * P_model + λ * P_engram How does Engram decide whether to 'think' or 'lookup'? Via a learnable gating coefficient λ (Lambda) that adapts based on context.

Transformer Model

P(Paris) = 0.60
P(Lyon) = 0.25
Model infers from context, thinks it's Paris, but hesitates with Lyon.
Context: "The capital of France is..."
Trust Model (λ=0)Trust Memory (λ=1)
λ = 0.50

Engram Memory

P(Paris) = 0.95
P(Lyon) = 0.00
Lookup shows "Paris" follows "Capital of France" 95% of the time.
Paris
77.5%
Lyon
12.5%
London
10.0%

Paper insight: For rare knowledge (Long-tail), DeepSeek Engram's λ automatically increases (rely on memory); for reasoning tasks, λ decreases (rely on Transformer model). This adaptive gating is what makes Engram so powerful.

动手试试:Engram 关注哪里?

在下面的输入框打字。红色越深,代表 Engram 模块介入程度越高(即模型认为这是一个“死知识”,直接查表就行)。

Analysis results will appear here...
Low Activation
High Activation (Memory Lookup)
Advantage

Instant Domain Adaptation

Traditional LLMs need expensive continual pre-training to learn new domains (e.g., new laws or internal company documents). This process requires GPU clusters and can take days or weeks.

DeepSeek Engram changes this entirely. You just update the N-gram index table on disk, and the model 'sees' the new domain knowledge immediately without any neural network weight updates. This makes Engram ideal for enterprise deployments where knowledge changes frequently.

Traditional: Retrain GPU Cluster
Engram: Update Index on Disk
Old Data (2020)
N-grams indexed
New Data (2025)
Updated N-gram Table Only
Instant
> Query: "Latest 2025 Regulations"
> Found in Engram Table.
> Interpolating... Done.

How is DeepSeek Engram different from RAG?

Many mistake DeepSeek Engram for RAG (Retrieval Augmented Generation). Though both address 'hallucinations', their implementation methods differ vastly. Engram operates at the token level with O(1) hash lookups, while RAG retrieves entire document chunks.

Traditional LLM

Pure Transformer

RAG

Retriever + Generator
WINNER

Engram

N-gram + Model
Knowledge Update Speed
Slow
Retrain
Fast
Update Vector DB
Instant
Update Hash Table
Inference Speed
Fast
Direct Gen
Slow
Retrieve + Read
Ultra Fast
O(1) Lookup
Compute Cost (VRAM)
High
Params
Very High
Long Context
Low
Sparse Access

The Golden Ratio: U-Shaped Curve

The DeepSeek Engram paper discovered a key 'budget allocation' law. If total model parameters are fixed, there is an optimal split between neural computation and Engram memory:

  • All MoE Experts (100%):
    Smart, but wastes brainpower on trivia.
  • Hybrid (~80% Experts / 20% Memory):
    The sweet spot (lowest Loss). This is the power of Engram.

Validation Loss (Lower is Better)

0%20%40%60%100%% MoE Params1.7051.721.7351.75Sweet Spot

专家视点

学术界与工业界如何评价 Engram 架构?

D
Dr. Alex Chen
AI Research Scientist @ Stanford

"Engram 的 U 型曲线发现令人震惊。它不仅验证了'记忆'与'计算'可以解耦,更重要的是给出了具体的黄金比例 (80/20)。这意味着我们长期以来都在浪费算力去训练模型背诵百科全书。"

Via Twitter
S
Sarah Miller
Lead LLM Architect

"对于企业级应用来说,Engram 的'即时领域适应'特性是杀手级的。无需重新预训练,只需更新 N-gram 索引表就能让模型掌握最新的私有数据,这极大地降低了 RAG 系统的复杂度和成本。"

Via LinkedIn
O
OpenAI Observer
Tech Blogger

"DeepSeek 这篇论文回归了本质:N-gram 这种看似古老的技术,在现代 Transformer 架构下焕发了新生。O(1) 复杂度的查表机制可能是解决 Long-tail 知识幻觉的最佳方案。"

Via Paper Review

FAQ

N-grams are ancient NLP technology. DeepSeek's innovation with Engram is combining them as an independent 'plugin' with modern Transformers via learned linear interpolation. This 'Retro + Modern' combination in Engram requires deep mathematical intuition to balance the weights between neural computation and hash-based retrieval.
Great question! If Engram returns 'Unknown', its prediction probability is very low. The gating coefficient λ automatically shifts toward the Transformer model. So DeepSeek Engram uses lookup for rote facts and neural thinking for novel situations — seamlessly.
No, Engram makes it smarter. By offloading memory tasks to disk (the N-gram table), neural parameters can focus entirely on learning logic, grammar, and sentiment analysis. DeepSeek Engram embodies the principle of 'using resources where they matter most'.
The Engram N-gram table can scale to billions of entries stored on standard disk storage. Because Engram uses hash-based lookups with O(1) complexity, the table size does not affect retrieval speed. This means DeepSeek Engram can store an entire encyclopedia of factual knowledge without any impact on inference latency.
In principle, yes. DeepSeek Engram is designed as a modular component that integrates via linear interpolation at the output logit level. The Engram module can be attached to any autoregressive language model, though optimal performance requires fine-tuning the gating coefficient λ to learn when to rely on Engram memory versus neural computation.

Knowledge Quiz

How much have you learned?

QUESTION 1 / 3Score: 0

What problem does Engram architecture mainly solve?