Engram-Guided Structural Memory for Reliable Long-Context OCR and Vision-Based Memory Compression

Community Article Published January 23, 2026

GrooveJ aka Danny Lee
Independent Researcher
Email: groovejaylee@gmail.com

Abstract

This paper does not present a finished system nor experimental results.
We propose a conceptual framework and research agenda for reliable long-context OCR and vision-based memory compression.

Recent vision-language models compress long textual contexts by rendering text into images and encoding them into a small number of vision tokens. While effective at moderate compression ratios, this paradigm exhibits a dangerous failure mode under extreme compression: when visual evidence is weak, decoders increasingly rely on language priors, producing fluent but incorrect hallucinated text.

We introduce an Engram-Guided Dual-Path Memory framework that separates pixel-based visual compression from structure-only memory traces (engrams). Engrams encode morphological and layout constraints without storing semantic content, and decoding is explicitly constrained to satisfy both visual evidence and structural memory.

We further propose an honest uncertainty mechanism that enables the model to emit explicit unknown tokens rather than hallucinate.

This article defines new failure modes, architectural principles, and evaluation protocols for future work on reliable long-context OCR.

1. Introduction

Large Language Models (LLMs) and Vision-Language Models (VLMs) face a fundamental bottleneck in processing long contexts due to the quadratic complexity of attention with respect to sequence length.

To mitigate this, recent work proposes optical context compression: render long text into images, encode them via a vision encoder into a small number of vision tokens, and decode them back into text.

DeepSeek-OCR demonstrates that a small number of vision tokens can reconstruct up to approximately 10× longer text with near-lossless accuracy, and even 20× compression with degraded but usable performance.

However, recent critical analysis reveals a fundamental failure mode.
When visual evidence is weak, ambiguous, or strongly compressed, the decoder increasingly relies on language priors rather than perceptual evidence, producing outputs that are:

fluent,
statistically likely,
but factually incorrect.

Formally, let:

$I$ be a rendered document image,
$Z_v$ be visual tokens extracted from $I$,
$\hat{T}$ be the reconstructed text.

Standard decoding solves:

$\hat{T} = \arg\max_{T \in \mathcal{T}} p(T \mid Z_v; \theta_{LM})$

When visual evidence is insufficient, the conditional distribution collapses:

$p(T \mid Z_v; \theta_{LM}) \to p(T \mid \theta_{LM})$

In this regime, reconstruction is no longer OCR, but language model autocompletion.

This motivates our central question:

When evidence is weak, should an AI guess, or should it admit ignorance?

2. Scope and Position of This Work

This is not an experimental paper.

We explicitly state:

No implementation is presented.
No benchmark results are reported.
No performance superiority is claimed.

This article is a position paper and conceptual framework.

Our goals are:

To precisely characterize the prior-dominated failure mode in optical OCR compression.
To propose a principled architectural mechanism to structurally suppress hallucination.
To define a concrete research agenda and evaluation protocol for future work.

3. Related Work

3.1 End-to-End OCR and Vision-Language Models

End-to-end OCR models such as Nougat, GOT-OCR2.0, InternVL, Qwen-VL, and DeepSeek-OCR unify detection and recognition in a single VLM architecture.

DeepSeek-OCR is the first to explicitly frame OCR as a vision-text compression problem, reporting quantitative compression ratios and demonstrating feasibility up to 20×.

3.2 Optical Context Compression

DeepSeek-OCR compresses long contexts by rendering text into 2D images and encoding them into a small number of vision tokens.

However, Liang et al. show that under semantic disruption and random-text conditions, performance collapses, revealing strong reliance on language priors rather than visual evidence.

3.3 Hallucination in Multimodal Models

Hallucination in multimodal LLMs has been widely studied.
Prior-driven hallucination is particularly severe when perceptual signals are weak or ambiguous.

Existing mitigation strategies focus on:

confidence calibration
uncertainty heads
selective prediction

We argue these are insufficient because they do not constrain the feasible output space.

3.4 Human Memory and Engram Theory

In neuroscience, an engram denotes a memory trace that stores structural information rather than semantic content.

Human memory degrades gracefully: content fades before structure.

We borrow this principle as a design inspiration.

4. Problem Formulation

Let:

Standard decoding:

$\hat{T} = \arg\max_{T} p(T \mid Z_v; \theta_{LM})$

When $Z_v$ is insufficient:

$p(T \mid Z_v; \theta_{LM}) \to p(T \mid \theta_{LM})$

We define prior-dominated decoding risk:

$H = \mathbb{E}_{Z_v \sim weak} \left[ \mathbf{1}(\hat{T} \neq T^\*) \cdot \mathbf{1}(\text{fluent}(\hat{T})) \right]$

Our objective is to minimize hallucination risk $H$ under extreme compression.

5. Engram-Guided Dual-Path Architecture

We propose a Dual-Path Compression Architecture.

Let:

$Z_v = f_v(I), \quad Z_e = f_e(I)$

where:

$Z_v$ are pixel-based visual tokens
$Z_e$ are engram codes encoding structure only

Decoding solves:

$\hat{T} = \arg\max_{T} p(T \mid Z_v, Z_e; \theta)$

subject to a structural constraint:

$C (T, Z_{e}) = 1$

This architecture explicitly separates:

perceptual evidence (vision tokens)
structural memory (engrams)

and constrains decoding by both.

6. Engram Definition (Formalization)

We define an engram as a tuple of structural variables:

$Z_e = \{ L, \{W_i\}_{i=1}^L, \{d_{ij}\}, \{s_{ij}\}, \{b_k\} \}$

where:

$L$ : number of lines
$W_i$ : number of tokens in line $i$
$d_{ij}$ : digit-length of token $j$ in line $i$
$s_{ij} \in {\text{digit, letter, punct}}$ : symbol class
$b_k$ : paragraph and block boundaries

Crucially, $Z_e$ stores no lexical tokens and no semantic content.

This representation captures only the document’s layout and low-level structural regularities, while deliberately discarding all lexical identities. As a result, the engram defines a constrained hypothesis space over valid document structures without encoding any semantic content.

7. Constrained Decoding with Engram

We propose constraint-aware beam search.

At each decoding step $t$, maintain a beam $B_t$.

Expansion:

$B_{t+1} = \{ T' \mid T' = T \oplus x, \; T \in B_t, \; x \in \mathcal{V}, \; C(T', Z_e) = 1 \}$

Keep top-$B$ candidates:

$B_{t+1} = \text{TopB}_{T' \in B_{t+1}} \log p(T' \mid Z_v, Z_e)$

If:

$B_{t+1} = \varnothing$

we output the uncertainty token ⟨UNK⟩.

At each decoding step, we only allow token expansions that are consistent with the engram constraints; if no valid expansion exists, the decoder explicitly abstains by emitting ⟨UNK⟩.

8. Honest Uncertainty Mechanism

We introduce explicit uncertainty tokens:

⟨UNK⟩
⟨NUM_2DIGIT⟩
⟨WORD_LEN_5⟩

At each position $i$:

$\hat{T}_i = \begin{cases} T_i, & \text{if } p(T_i \mid Z_v, Z_e) \ge \tau \\ \langle UNK \rangle, & \text{otherwise} \end{cases}$

This enforces a simple principle:

It is better to say “I don’t know” than to hallucinate.

9. Training Objective (Proposed)

We propose a composite loss:

$\mathcal{L} = \mathcal{L}_{OCR} + \alpha \mathcal{L}_{Engram} + \beta \mathcal{L}_{Consistency}$

OCR Loss

$\mathcal{L}_{OCR} = - \sum_t \log p(T_t^\* \mid Z_v, Z_e)$

Engram Loss

$\mathcal{L}_{Engram} = \sum_i \ell(L_i, \hat{L}_i) + \sum_{i,j} \ell(d_{ij}, \hat{d}_{ij}) + \sum_{i,j} \ell(s_{ij}, \hat{s}_{ij})$

Consistency Loss

Given multiple renderings $I^{(1)}, I^{(2)}$:

$\mathcal{L}_{Consistency} = \mathbb{E} \left[ \| \hat{T}^{(1)} - \hat{T}^{(2)} \|_1 \right]$

Additionally, we train on:

random-text documents
grammar-broken text
semantic-free layouts

to explicitly break language priors.

This composite objective is designed to explicitly disentangle recognition accuracy from structural regularization and prior suppression. The OCR loss preserves end-task performance, while the engram loss forces the model to predict only semantic-free structural variables. The consistency term enforces invariance across multiple renderings, preventing the decoder from exploiting spurious visual cues. Together, these terms explicitly discourage reliance on language priors and encourage the model to abstain rather than hallucinate under weak visual evidence.

10. Expected Results and Hypotheses

We state falsifiable hypotheses:

H1 (Hallucination Reduction)
$HallucinationRate_{Engram} < HallucinationRate_{Baseline}$
H2 (Honest Uncertainty Increase)
$HonestUncertainty_{Engram} > HonestUncertainty_{Baseline}$
H3 (Recall Recovery)
$FinalRecall_{Engram} = FinalRecall_{Baseline}$

Together, these hypotheses characterize the central trade-off induced by engram-guided decoding: hallucinations are reduced by explicitly increasing abstentions, while the final recall is preserved up to a small degradation. This tests whether reliability can be improved without sacrificing end-task accuracy.

11. Evaluation Protocol (Proposed)

We propose new metrics:

Conventional OCR metrics measure how often the system is correct, but not how dangerously it fails when it is wrong. We therefore propose evaluation metrics that explicitly quantify hallucination, honest uncertainty, and structural invalidity under extreme compression. This protocol directly reflects our objective of minimizing risky failures rather than maximizing average accuracy.

Hallucination Rate
Honest Uncertainty Rate
Structural Violation Rate
Extreme Compression Robustness

Stress tests include:

These stress tests are designed to explicitly break language priors and visual shortcuts, and to expose failure modes that are invisible under standard benchmarks.

semantic disruption test
random-text OCR
low-contrast rendering

12. Limitations

No experiments
Unknown computational overhead
Unknown recall–precision tradeoff
Unknown scalability

This paper makes no performance claim.

13. Conclusion

We ask:

When evidence is weak, should an AI guess, or should it admit ignorance?

We argue that:

pixel compression alone is insufficient
structural memory must be explicit
reliable memory requires designed forgetting

By introducing an Engram-Guided Dual-Path architecture and an honest uncertainty mechanism, we aim to improve the reliability of extreme optical context compression under weak visual evidence.

Beyond efficiency, this perspective also suggests a shift from optimizing solely for average accuracy toward controlling how the system behaves when it fails. In safety-critical applications, a small number of fluent but incorrect predictions may be more harmful than a larger number of explicitly uncertain outputs.

In future work, we envision a selective re-encoding mechanism, where positions emitting ⟨UNK⟩ trigger local high-resolution re-rendering and re-encoding. This enables recovery of uncertain tokens without reprocessing the full document, while preserving the benefits of global compression.

Alternatively, enriching the engram with fine-grained geometric constraints or introducing a vision-only candidate generator may further reduce the frequency of ⟨UNK⟩ emissions.

References

[1] Wei et al., DeepSeek-OCR, arXiv 2025
[2] Liang et al., Visual Merit or Linguistic Crutch?, arXiv 2026
[3] GOT-OCR2.0, 2024
[4] Pix2Struct, ICML 2023
[5] Hallucination of Multimodal LLMs, 2024
[6] Segment Anything, ICCV 2023
[7] CLIP, ICML 2021
[8] InternVL2, 2024
[9] Qwen-VL, 2024

Engram-Guided Structural Memory for Single-View 3D Reconstruction in Humanoid Robotics

January 26, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote