Engram-Guided Structural Memory for Reliable Long-Context OCR and Vision-Based Memory Compression
GrooveJ aka Danny Lee
Independent Researcher
Email: groovejaylee@gmail.com
Abstract
This paper does not present a finished system nor experimental results.
We propose a conceptual framework and research agenda for reliable long-context OCR and vision-based memory compression.
Recent vision-language models compress long textual contexts by rendering text into images and encoding them into a small number of vision tokens. While effective at moderate compression ratios, this paradigm exhibits a dangerous failure mode under extreme compression: when visual evidence is weak, decoders increasingly rely on language priors, producing fluent but incorrect hallucinated text.
We introduce an Engram-Guided Dual-Path Memory framework that separates pixel-based visual compression from structure-only memory traces (engrams). Engrams encode morphological and layout constraints without storing semantic content, and decoding is explicitly constrained to satisfy both visual evidence and structural memory.
We further propose an honest uncertainty mechanism that enables the model to emit explicit unknown tokens rather than hallucinate.
This article defines new failure modes, architectural principles, and evaluation protocols for future work on reliable long-context OCR.
1. Introduction
Large Language Models (LLMs) and Vision-Language Models (VLMs) face a fundamental bottleneck in processing long contexts due to the quadratic complexity of attention with respect to sequence length.
To mitigate this, recent work proposes optical context compression: render long text into images, encode them via a vision encoder into a small number of vision tokens, and decode them back into text.
DeepSeek-OCR demonstrates that a small number of vision tokens can reconstruct up to approximately 10× longer text with near-lossless accuracy, and even 20× compression with degraded but usable performance.
However, recent critical analysis reveals a fundamental failure mode.
When visual evidence is weak, ambiguous, or strongly compressed, the decoder increasingly relies on language priors rather than perceptual evidence, producing outputs that are:
- fluent,
- statistically likely,
- but factually incorrect.
Formally, let:
- $I$ be a rendered document image,
- $Z_v$ be visual tokens extracted from $I$,
- $\hat{T}$ be the reconstructed text.
Standard decoding solves:
When visual evidence is insufficient, the conditional distribution collapses:
In this regime, reconstruction is no longer OCR, but language model autocompletion.
This motivates our central question:
When evidence is weak, should an AI guess, or should it admit ignorance?
2. Scope and Position of This Work
This is not an experimental paper.
We explicitly state:
- No implementation is presented.
- No benchmark results are reported.
- No performance superiority is claimed.
This article is a position paper and conceptual framework.
Our goals are:
- To precisely characterize the prior-dominated failure mode in optical OCR compression.
- To propose a principled architectural mechanism to structurally suppress hallucination.
- To define a concrete research agenda and evaluation protocol for future work.
3. Related Work
3.1 End-to-End OCR and Vision-Language Models
End-to-end OCR models such as Nougat, GOT-OCR2.0, InternVL, Qwen-VL, and DeepSeek-OCR unify detection and recognition in a single VLM architecture.
DeepSeek-OCR is the first to explicitly frame OCR as a vision-text compression problem, reporting quantitative compression ratios and demonstrating feasibility up to 20×.
3.2 Optical Context Compression
DeepSeek-OCR compresses long contexts by rendering text into 2D images and encoding them into a small number of vision tokens.
However, Liang et al. show that under semantic disruption and random-text conditions, performance collapses, revealing strong reliance on language priors rather than visual evidence.
3.3 Hallucination in Multimodal Models
Hallucination in multimodal LLMs has been widely studied.
Prior-driven hallucination is particularly severe when perceptual signals are weak or ambiguous.
Existing mitigation strategies focus on:
- confidence calibration
- uncertainty heads
- selective prediction
We argue these are insufficient because they do not constrain the feasible output space.
3.4 Human Memory and Engram Theory
In neuroscience, an engram denotes a memory trace that stores structural information rather than semantic content.
Human memory degrades gracefully: content fades before structure.
We borrow this principle as a design inspiration.
4. Problem Formulation
Let:
Standard decoding:
When $Z_v$ is insufficient:
We define prior-dominated decoding risk:
Our objective is to minimize hallucination risk $H$ under extreme compression.
5. Engram-Guided Dual-Path Architecture
We propose a Dual-Path Compression Architecture.
Let:
where:
- $Z_v$ are pixel-based visual tokens
- $Z_e$ are engram codes encoding structure only
Decoding solves:
subject to a structural constraint:
This architecture explicitly separates:
- perceptual evidence (vision tokens)
- structural memory (engrams)
and constrains decoding by both.
6. Engram Definition (Formalization)
We define an engram as a tuple of structural variables:
where:
- $L$ : number of lines
- $W_i$ : number of tokens in line $i$
- $d_{ij}$ : digit-length of token $j$ in line $i$
- $s_{ij} \in {\text{digit, letter, punct}}$ : symbol class
- $b_k$ : paragraph and block boundaries
Crucially, $Z_e$ stores no lexical tokens and no semantic content.
This representation captures only the document’s layout and low-level structural regularities, while deliberately discarding all lexical identities. As a result, the engram defines a constrained hypothesis space over valid document structures without encoding any semantic content.
7. Constrained Decoding with Engram
We propose constraint-aware beam search.
At each decoding step $t$, maintain a beam $B_t$.
Expansion:
Keep top-$B$ candidates:
If:
we output the uncertainty token ⟨UNK⟩.
At each decoding step, we only allow token expansions that are consistent with the engram constraints; if no valid expansion exists, the decoder explicitly abstains by emitting ⟨UNK⟩.
8. Honest Uncertainty Mechanism
We introduce explicit uncertainty tokens:
- ⟨UNK⟩
- ⟨NUM_2DIGIT⟩
- ⟨WORD_LEN_5⟩
At each position $i$:
This enforces a simple principle:
It is better to say “I don’t know” than to hallucinate.
9. Training Objective (Proposed)
We propose a composite loss:
OCR Loss
Engram Loss
Consistency Loss
Given multiple renderings $I^{(1)}, I^{(2)}$:
Additionally, we train on:
- random-text documents
- grammar-broken text
- semantic-free layouts
to explicitly break language priors.
This composite objective is designed to explicitly disentangle recognition accuracy from structural regularization and prior suppression. The OCR loss preserves end-task performance, while the engram loss forces the model to predict only semantic-free structural variables. The consistency term enforces invariance across multiple renderings, preventing the decoder from exploiting spurious visual cues. Together, these terms explicitly discourage reliance on language priors and encourage the model to abstain rather than hallucinate under weak visual evidence.
10. Expected Results and Hypotheses
We state falsifiable hypotheses:
H1 (Hallucination Reduction)
$HallucinationRate_{Engram} < HallucinationRate_{Baseline}$H2 (Honest Uncertainty Increase)
$HonestUncertainty_{Engram} > HonestUncertainty_{Baseline}$H3 (Recall Recovery)
$FinalRecall_{Engram} = FinalRecall_{Baseline}$
Together, these hypotheses characterize the central trade-off induced by engram-guided decoding: hallucinations are reduced by explicitly increasing abstentions, while the final recall is preserved up to a small degradation. This tests whether reliability can be improved without sacrificing end-task accuracy.
11. Evaluation Protocol (Proposed)
We propose new metrics:
Conventional OCR metrics measure how often the system is correct, but not how dangerously it fails when it is wrong. We therefore propose evaluation metrics that explicitly quantify hallucination, honest uncertainty, and structural invalidity under extreme compression. This protocol directly reflects our objective of minimizing risky failures rather than maximizing average accuracy.
- Hallucination Rate
- Honest Uncertainty Rate
- Structural Violation Rate
- Extreme Compression Robustness
Stress tests include:
These stress tests are designed to explicitly break language priors and visual shortcuts, and to expose failure modes that are invisible under standard benchmarks.
- semantic disruption test
- random-text OCR
- low-contrast rendering
12. Limitations
- No experiments
- Unknown computational overhead
- Unknown recall–precision tradeoff
- Unknown scalability
This paper makes no performance claim.
13. Conclusion
We ask:
When evidence is weak, should an AI guess, or should it admit ignorance?
We argue that:
- pixel compression alone is insufficient
- structural memory must be explicit
- reliable memory requires designed forgetting
By introducing an Engram-Guided Dual-Path architecture and an honest uncertainty mechanism, we aim to improve the reliability of extreme optical context compression under weak visual evidence.
Beyond efficiency, this perspective also suggests a shift from optimizing solely for average accuracy toward controlling how the system behaves when it fails. In safety-critical applications, a small number of fluent but incorrect predictions may be more harmful than a larger number of explicitly uncertain outputs.
In future work, we envision a selective re-encoding mechanism, where positions emitting ⟨UNK⟩ trigger local high-resolution re-rendering and re-encoding. This enables recovery of uncertain tokens without reprocessing the full document, while preserving the benefits of global compression.
Alternatively, enriching the engram with fine-grained geometric constraints or introducing a vision-only candidate generator may further reduce the frequency of ⟨UNK⟩ emissions.
References
[1] Wei et al., DeepSeek-OCR, arXiv 2025
[2] Liang et al., Visual Merit or Linguistic Crutch?, arXiv 2026
[3] GOT-OCR2.0, 2024
[4] Pix2Struct, ICML 2023
[5] Hallucination of Multimodal LLMs, 2024
[6] Segment Anything, ICCV 2023
[7] CLIP, ICML 2021
[8] InternVL2, 2024
[9] Qwen-VL, 2024
Full Paper
📄 Full PDF + Code & Paper Repository:
https://github.com/groovejaylee-sudo/llmpaper


