Your Model Doesn't Need to Re-Read the Document: Introducing Stateful Neural Databases
The Problem
Large language models have a memory problem. A 70B model serving 128K tokens of context burns 160 GB on KV cache alone. That cache grows linearly with sequence length, making long-context inference expensive and stateless — every new session re-reads the entire document from scratch.
The Idea: Bounded Memory That Persists
CoDA-GQA-L (Constrained Orthogonal Differential Attention with Grouped-Query Value-Routed Landmark Banks) replaces the standard KV cache with a fixed-size, 384-slot buffer per layer that never grows, regardless of input length.
| Context Length | Standard KV (70B) | CoDA-GQA-L | Compression |
|---|---|---|---|
| 2K | 2.56 GB | 120 MB | 21× |
| 32K | 40 GB | 120 MB | 341× |
| 128K | 160 GB | 120 MB | 1,365× |
The buffer has three segments:
- Recent window — a ring buffer of exact recent tokens (FIFO eviction)
- Exact landmark bank — an LRU cache of important tokens selected by a learned write gate
- Summary landmark bank — EMA prototypes that compress older context into semantic centroids
The key insight: this bounded state is serializable. You can torch.save() it, come back a week later, torch.load() it, and query the document without re-reading a single token. We call this pattern stateful neural databases.
What the Demo Does
The Stateful Neural Database demo on Hugging Face Spaces showcases this workflow in three steps:
1. Ingest
Paste or upload a document. The model (Qwen3-4B with CoDA adapters) processes it through bounded attention layers and produces a fixed-size neural state. The state size is the same whether your document is 500 tokens or 50,000.
2. Save
The state serializes to a .pt file (~61 MB at 4B scale). Save it to the built-in library with a label, or download it for later.
3. Query
Load any saved state — from the library dropdown or by uploading a .pt file — and ask questions. The original document is never re-read. Each query deep-copies the state, so multiple questions don't interfere with each other.
The demo ships with three bundled example documents so you can skip the ingest step and jump straight to querying. It runs on ZeroGPU with carefully tuned duration budgets to stay within free-tier limits.
Why This Matters
The stateful neural database pattern decouples document processing from document querying:
- Process once, query forever. Ingest a 100-page report today, query it next month from a 61 MB file.
- Constant memory. The state doesn't grow with document length — 384 slots per layer, always.
- Portable. States are plain PyTorch tensors. Move them between machines, store them in S3, version them in Git LFS.
- Composable. In principle, multiple document states could be loaded and cross-referenced (future work).
This isn't retrieval-augmented generation. There's no vector database, no chunk-and-embed pipeline, no retrieval step. The model's attention layers are the database — they've compressed the document into a fixed-size representation that supports direct querying.
Under the Hood
Three ideas make this work:
Differential attention via orthogonal rotation. Building on Microsoft's Diff Transformer, CoDA uses a single query projection with a learned per-head rotation to derive signal and noise streams. Signal minus gated noise, one SDPA call. A factorial ablation shows this reduces the bounded-memory penalty by 5.7× compared to standard GQA.
Value-routing for memory banks. Keys carry RoPE positional encodings, making their cosine similarity position-dependent — the same word at position 100 vs. 5000 looks orthogonal in key-space. Memory banks route on Values instead (RoPE-free, pure semantic content), which is what makes deduplication and EMA blending actually work.
Two-phase training. Phase 1 trains unbounded differential attention to baseline quality. Phase 2 switches to bounded cache with gradient flow through evictions, so the write gate learns what to keep and what to discard. Without Phase 2, bounded evaluation is catastrophic (PPL jumps from 5.75 to 2,464).
Try It
Load one of the bundled examples and ask a question. Or paste your own document, ingest it, save the state, and come back to it later.
The full source — attention module, memory banks, Triton kernels, training pipeline, and this demo — is available at github.com/anthony-maio/CoDA-GQA-L under the MIT license.
Built by @anthonym21. The CoDA-GQA-L mechanism is described in detail in the project's architecture deep dive.