Your Model Doesn't Need to Re-Read the Document: Introducing Stateful Neural Databases

Community Article Published March 14, 2026

TL;DR — We built a demo that processes a document into a tiny, fixed-size neural state (~48–61 MB), saves it to disk, and answers questions about it later — without ever touching the original text again.

The Problem

Large language models have a memory problem. A 70B model serving 128K tokens of context burns 160 GB on KV cache alone. That cache grows linearly with sequence length, making long-context inference expensive and stateless — every new session re-reads the entire document from scratch.

The Idea: Bounded Memory That Persists

CoDA-GQA-L (Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks) replaces the standard KV cache with a fixed-size, 384-slot buffer per layer that never grows, regardless of input length.

Context Length Standard KV (70B) CoDA-GQA-L Compression
2K 2.56 GB 120 MB 21×
32K 40 GB 120 MB 341×
128K 160 GB 120 MB 1,365×

The buffer has three segments:

  • Recent window — a ring buffer of exact recent tokens (FIFO eviction)
  • Exact landmark bank — an LRU cache of important tokens selected by a learned write gate
  • Summary landmark bank — EMA prototypes that compress older context into semantic centroids

The key insight: this bounded state is serializable. You can torch.save() it, come back a week later, torch.load() it, and query the document without re-reading a single token. We call this pattern stateful neural databases.

What the Demo Does

The Stateful Neural Database demo on Hugging Face Spaces showcases this workflow in three steps:

1. Ingest

Paste or upload a document. The model (Qwen3-4B with CoDA adapters) processes it through bounded attention layers and produces a fixed-size neural state. The state size is the same whether your document is 500 tokens or 50,000.

2. Save

The state serializes to a .pt file (~61 MB at 4B scale). Save it to the built-in library with a label, or download it for later.

3. Query

Load any saved state — from the library dropdown or by uploading a .pt file — and ask questions. The original document is never re-read. Each query deep-copies the state, so multiple questions don't interfere with each other.

The demo ships with three bundled example documents so you can skip the ingest step and jump straight to querying. It runs on ZeroGPU with carefully tuned duration budgets to stay within free-tier limits.

Why This Matters

The stateful neural database pattern decouples document processing from document querying:

  • Process once, query forever. Ingest a 100-page report today, query it next month from a 61 MB file.
  • Constant memory. The state doesn't grow with document length — 384 slots per layer, always.
  • Portable. States are plain PyTorch tensors. Move them between machines, store them in S3, version them in Git LFS.
  • Composable. In principle, multiple document states could be loaded and cross-referenced (future work).

This isn't retrieval-augmented generation. There's no vector database, no chunk-and-embed pipeline, no retrieval step. The model's attention layers are the database — they've compressed the document into a fixed-size representation that supports direct querying.

Under the Hood

Three ideas make this work:

Differential attention via orthogonal rotation. Building on Microsoft's Diff Transformer, CoDA uses a single query projection with a learned per-head rotation to derive signal and noise streams. Signal minus gated noise, one SDPA call. A factorial ablation shows this reduces the bounded-memory penalty by 5.7× compared to standard GQA.

Value-routing for memory banks. Keys carry RoPE positional encodings, making their cosine similarity position-dependent — the same word at position 100 vs. 5000 looks orthogonal in key-space. Memory banks route on Values instead (RoPE-free, pure semantic content), which is what makes deduplication and EMA blending actually work.

Two-phase training. Phase 1 trains unbounded differential attention to baseline quality. Phase 2 switches to bounded cache with gradient flow through evictions, so the write gate learns what to keep and what to discard. Without Phase 2, bounded evaluation is catastrophic (PPL jumps from 5.75 to 2,464).

Try It

Launch the demo →

Load one of the bundled examples and ask a question. Or paste your own document, ingest it, save the state, and come back to it later.

The full source — attention module, memory banks, Triton kernels, training pipeline, and this demo — is available at github.com/anthony-maio/CoDA-GQA-L under the MIT license.


Built by @anthonym21. The CoDA-GQA-L mechanism is described in detail in the project's architecture deep dive.

Community

Sign up or log in to comment