Initial release: FlashMemory DS-V4 Retriever

- FlashMemoryRetriever model (retriever.py)
- Minimal demo with mock inputs (demo.py)
- Toy sparse-decode inference reference (toy_flashmemory_inference.py)
- Model weights (flashmemory_ds_v4.safetensors, ~510 MB)

Co-Authored-By: Claude Code <noreply@anthropic.com>

Files changed (8) hide show

.gitattributes +1 -0
LICENSE +21 -0
README.md +320 -0
demo.py +133 -0
requirements.txt +3 -0
retriever.py +505 -0
toy_flashmemory_inference.py +312 -0
weights/flashmemory_ds_v4.safetensors +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ weights/*.safetensors filter=lfs diff=lfs merge=lfs -text

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 FlashMemory Authors
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,320 @@

+# FlashMemory DS-V4 Retriever
+A standalone, dependency-light reference implementation of the **FlashMemory DS-V4
+Retriever** — a lightweight retriever that sparsifies the **DeepSeek-V4
+Compressed-Sparse-Attention (CSA)** KV cache.
+Given the hidden state of a decode token, the retriever predicts which CSA
+KV-cache chunks (compressed keys) the upcoming tokens will attend to, so that
+only the **top-scoring chunks** need to stay resident on the GPU and the rest can
+be offloaded to CPU / disk. This recovers most of the quality of full attention
+on long-context tasks while keeping a small fraction of the KV cache on-device.
+This release contains the **algorithm + weights + a minimal, runnable PyTorch
+demo**. It depends only on `torch` (plus `numpy` / `safetensors` for convenience).
+> **Scope note.** The full sglang serving integration — KV-cache swap-in/out,
+> attention-sink, threshold fallback, per-request retriever routing — is **not**
+> included here, because it is tightly coupled to the internal DeepSeek-V4 CSA
+> framework and cannot run outside it. This repository provides the retriever
+> **algorithm reference implementation and trained weights only.**
+---
+## Model architecture
+The retriever scores each compressed-K chunk against the decode token's hidden
+state. For a single CSA layer:
+```
+hidden [B, 4096]
+    → wq_a        (4096 → Q_LORA_RANK)
+    → RMSNorm     (q_norm_weight, eps=1e-6)
+    → wq_b        (Q_LORA_RANK → N_HEADS * HEAD_DIM)
+    → reshape     [B, N_HEADS, HEAD_DIM]
+    → RoPE        (YaRN, applied to the last ROPE_DIM=64 dims, base=160000)
+    → Hadamard    (normalized Walsh-Hadamard transform)
+    → q           [B, N_HEADS, HEAD_DIM]
+hidden [B, 4096]
+    → weights_proj (4096 → N_HEADS)
+    → × weight_scale          (= HEAD_DIM^-0.5 * N_HEADS^-0.5)
+    → fused_w     [B, N_HEADS]
+compressed_k [B, N, HEAD_DIM + 4] (uint8)
+    → bytes[:HEAD_DIM]  viewed as float8_e4m3   → dequantize
+    → bytes[HEAD_DIM:]  viewed as float32        → per-chunk scale
+    → k           [B, N, HEAD_DIM]
+score_per_head = relu( einsum('bnd,bhd->bnh', k, q) )           # [B, N, N_HEADS]
+logit          = (score_per_head * fused_w[:, None, :]).sum(-1) # [B, N]
+score          = sigmoid(logit)  ∈ [0, 1]                       # [B, N]
+```
+**Hyperparameters (FlashMemory DS-V4):** `Q_LORA_RANK = 2048`, `N_HEADS = 128`,
+`HEAD_DIM = 128`, `ROPE_DIM = 64`, `ROPE_BASE = 160000`, `ROPE_FACTOR = 16`,
+`ROPE_ORIGINAL_SEQ_LEN = 65536`, `ROPE_BETA_FAST = 32`, `ROPE_BETA_SLOW = 1`,
+`RMS_NORM_EPS = 1e-6`.
+### Joint multi-layer checkpoint + ensemble
+FlashMemory DS-V4 is a **joint checkpoint** holding three independent CSA layers
+(`l10`, `l12`, `l20`), each with its own weights. At inference time the per-layer
+sigmoid scores are **ensembled per chunk** — cross-layer `max` (default) or
+`mean` — to produce a single keep/drop decision per chunk.
+---
+## What is FlashMemory DS-V4?
+FlashMemory DS-V4 is part of the latest retraining generation of these retrievers. In the
+project's downstream evaluation it stays close to the full-attention baseline on
+long-context tasks (e.g. RULER, LongMemEval, LongBench V2) while keeping only a
+small fraction of the CSA KV cache on-device (≈90% KV reduction in the deployment
+sweet spot for reasoning-heavy long-context tasks). Precise-needle retrieval
+tasks need an extra threshold-fallback mechanism in the serving layer (not part
+of this standalone release).
+---
+## Installation
+```bash
+pip install -r requirements.txt
+```
+Only `torch` is strictly required to run the model and demo. `float8_e4m3`
+tensor support requires a reasonably recent PyTorch (≥ 2.1).
+---
+## Running the demo
+```bash
+python demo.py --ckpt weights/flashmemory_ds_v4.safetensors
+```
+The demo builds **random mock inputs** (a batch of decode-token hidden states, a
+set of `uint8` compressed-K chunks, and token positions), loads the FlashMemory DS-V4
+checkpoint, runs the forward pass, prints the per-layer and ensembled per-chunk
+scores, and demonstrates both **threshold** and **top-K** chunk selection.
+Useful flags:
+| Flag | Default | Meaning |
+|------|---------|---------|
+| `--device` | `cpu` | `cpu` or `cuda` |
+| `--batch` | `2` | number of decode tokens |
+| `--n-chunks` | `64` | number of compressed-K chunks |
+| `--top-k` | `16` | top-K chunks to select |
+| `--threshold` | `0.5` | sigmoid keep threshold |
+| `--ensemble` | `max` | cross-layer ensemble mode (`max` / `mean`) |
+| `--max-position` | `524288` | RoPE table length (raise to `1048576` for 1M context) |
+Example output (CPU, default args):
+```
+[demo] loaded layers=['l10', 'l12', 'l20']  n_heads=128  head_dim=128  max_position=524288
+[demo] per-layer sigmoid score stats (over all chunks):
+    l10: min=0.4474 mean=0.5021 max=0.6416
+    ...
+[demo] threshold selection (sigmoid > 0.5):
+    row 0: keep 64/64 chunks  (keep ratio 100.0%)
+    row 1: keep 49/64 chunks  (keep ratio 76.6%)
+[demo] done. ✅  forward + scoring + selection all ran.
+```
+> The scores above come from **random mock K**, so they cluster near 0.5 — they
+> are only meaningful on real CSA keys. The demo's purpose is to verify the
+> load → forward → selection path end-to-end.
+---
+## Using the model in your own code
+```python
+import torch
+from retriever import FlashMemoryRetriever
+model = FlashMemoryRetriever.from_checkpoint(
+    "weights/flashmemory_ds_v4.safetensors", device="cuda", max_position=524288
+)
+hidden       = torch.randn(B, 4096, device="cuda")          # decode-token hidden states
+compressed_k = ...                                          # [B, N, 132] uint8 CSA keys
+positions    = torch.arange(B, device="cuda")               # int64 token positions
+# Per-layer sigmoid scores: {"l10": [B, N], "l12": [B, N], "l20": [B, N]}
+per_layer = model(hidden, compressed_k, positions)
+# Cross-layer ensembled per-chunk scores [B, N] ∈ [0, 1]
+scores = model.ensemble(hidden, compressed_k, positions, mode="max")
+# Boolean keep-mask [B, N] for the chunks to keep on-device
+keep = model.select_topk(hidden, compressed_k, positions, top_k=512)        # top-K
+keep = model.select_topk(hidden, compressed_k, positions, threshold=0.5)    # threshold
+```
+**`compressed_k` format.** Each chunk is `HEAD_DIM + 4 = 132` `uint8` bytes:
+the first `128` bytes are the `float8_e4m3` quantized key values, the last `4`
+bytes are a single `float32` per-chunk scale. Dequantization is
+`fp8_values.view(float8_e4m3).float() * scale`. See `make_mock_compressed_k` in
+`demo.py` for how to construct a valid tensor.
+---
+## Weights
+**Download:** [Hugging Face](https://huggingface.co/<HF_REPO>) — `flashmemory_ds_v4.safetensors` (≈510 MB).
+```bash
+huggingface-cli download <HF_REPO> flashmemory_ds_v4.safetensors --local-dir ./weights
+python demo.py --ckpt ./weights/flashmemory_ds_v4.safetensors
+```
+`from_checkpoint` accepts either a `.pt` (`torch.save` state-dict) or a
+`.safetensors` file. The released `.safetensors` is the **slim** form: it stores
+only the four learned tensors per layer
+(`wq_a.weight`, `wq_b.weight`, `q_norm_weight`, `weights_proj.weight` for
+`l10` / `l12` / `l20`) and **omits the `freqs_cis` RoPE table** (≈400 MB), which
+is recomputed at load time from `max_position`. Loading the slim `.safetensors`
+is bit-for-bit identical to loading the full `.pt` (verified by output match).
+---
+## Files
+| File | Purpose |
+|------|---------|
+| `retriever.py` | `FlashMemoryRetriever` model + RoPE/Hadamard utils + FP8 dequant (torch-only, self-contained) |
+| `demo.py` | minimal runnable demo with mock inputs |
+| `toy_flashmemory_inference.py` | toy DeepSeek-V4-FlashMemory sparse-decode loop showing **how the retriever drives memory recall at inference time** (see below) |
+| `requirements.txt` | `torch`, `safetensors`, `numpy` |
+| `LICENSE` | MIT |
+---
+## Toy FlashMemory inference reference (`toy_flashmemory_inference.py`)
+`demo.py` shows a single `hidden → scores` call. `toy_flashmemory_inference.py`
+is the **next step up**: a tiny, fully-runnable illustration of *how the Lightning
+Indexer Retriever is used inside a DeepSeek-V4-FlashMemory style sparse-decode
+loop* to drive "memory recall".
+It is intentionally small and pedagogical. It depends only on `torch` and the
+sibling `retriever.py`, and it **reuses the real FlashMemory DS-V4 retriever verbatim** — none
+of the scoring math is re-implemented.
+### The inference flow it demonstrates
+```
+ ┌──────────┐  compress & store   ┌────────────────────────────┐
+ │ PREFILL  │  historical K/V     │  CSA KV-cache (the memory) │
+ │ (dense   │ ──────────────────► │  N compressed chunks,      │
+ │  attn)   │                     │  each = [132] uint8 fp8-K  │
+ └────┬─────┘                     └──────────────┬─────────────┘
+      │ last hidden state                        │ scored every 64 steps
+      ▼                                          │
+ ┌──────────────────────── DECODE LOOP ─────────┼──────────────────────────┐
+ │ for each decode step t:                       │                          │
+ │   hidden = toy_decoder.step(token, keep_mask) │  (sparse memory attn)   │
+ │                                               │                          │
+ │   every RETRIEVAL_INTERVAL (= 64) steps:      ▼                          │
+ │     scores[N]   = retriever.ensemble(hidden, compressed_k, pos)          │
+ │     keep_mask[N] = top-K  (or  sigmoid > threshold)  of scores           │
+ │     → chunks NOT kept are masked to -inf in the next 64 decode steps     │
+ │       of memory attention  (== "not recalled onto the GPU")             │
+ └──────────────────────────────────────────────────────────────────────────┘
+```
+1. **Prefill (dense).** A short prompt is run through dense memory attention. Its
+   last hidden state seeds the first retrieval cycle (the indexer needs a query
+   hidden state to score against). In a real run, prefill is also where the
+   historical KV is compressed into the `[N, 132]` `uint8` CSA chunks.
+2. **Decode loop.** Every step the toy decoder produces a `[B, 4096]` hidden state
+   and attends over the `N` memory chunks.
+3. **Retrieval cycle (every 64 steps).** The real `FlashMemoryRetriever` scores all
+   `N` compressed-K chunks against the current decode hidden state, ensembles the
+   per-layer (`l10`/`l12`/`l20`) sigmoid scores, and selects the chunks to keep —
+   either **top-K** or **sigmoid > threshold**. This predicts which chunks the
+   *next ~64 tokens* will attend to.
+4. **Sparse attention.** For the next 64 steps, chunks **not** selected have their
+   memory-attention logits set to `-inf`, so they contribute nothing.
+### What the masking simulates (important)
+* This toy does **not** perform any real CPU↔GPU KV-cache transfer. The swap-in /
+  swap-out machinery is part of the internal FlashMemory engineering and is **not**
+  included in this release.
+* We **simulate memory recall by masking the FlashMemory Retriever's per-chunk
+  decisions**: a chunk the retriever did not select gets its attention logit set
+  to `-inf`. This is equivalent to *"that chunk's KV was never recalled onto the
+  GPU, so it cannot be attended to"* — for the attention output, masking a chunk
+  out and never loading it produce the same result.
+* The toy's purpose is to make the **decode-time control flow** concrete: where the
+  retriever fires, what it consumes (decode hidden state + compressed CSA keys),
+  what it produces (a keep/drop mask), and how that mask sparsifies the next
+  window of decode steps.
+### What it is / is NOT
+* **IS:** a minimal, torch-only illustration of the decode-time control flow that
+  drives memory recall with the real FlashMemory DS-V4 retriever.
+* **IS NOT:** a runnable DeepSeek-V4. The "decoder" is a couple of layers of
+  randomly-initialized toy attention/MLP whose only jobs are (a) to emit a
+  `[B, 4096]` hidden state for the retriever and (b) to own a memory attention we
+  can sparsify. The generated tokens are meaningless.
+> **The production version cannot be released.** It depends on the internal sglang
+> + DeepSeek-V4 CSA framework (native FP8 indexer, real compressed KV-cache,
+> attention-sink, threshold fallback, per-request routing, and the actual KV swap
+> engine). This file shows the *algorithmic role* of the retriever only.
+### Run
+```bash
+python toy_flashmemory_inference.py --ckpt weights/flashmemory_ds_v4.safetensors
+```
+Runs on CPU by default; pass `--device cuda` for GPU.
+| Flag | Default | Meaning |
+|------|---------|---------|
+| `--n-chunks` | `256` | number of CSA memory chunks (the long history) |
+| `--steps` | `192` | decode steps to generate |
+| `--retrieval-interval` | `64` | run the retriever every N steps (FlashMemory default) |
+| `--select-mode` | `topk` | `topk` or `threshold` |
+| `--top-k` | `64` | chunks to recall per cycle (`select-mode=topk`) |
+| `--threshold` | `0.5` | sigmoid keep threshold (`select-mode=threshold`) |
+| `--ensemble` | `max` | cross-layer ensemble mode (`max` / `mean`) |
+| `--batch` | `1` | parallel decode sequences |
+Example output (CPU, default args — `top-K=64` out of `256` chunks):
+```
+FlashMemory DS-V4 — toy sparse-decode loop
+[load] weights/flashmemory_ds_v4.safetensors
+[load] layers=['l10', 'l12', 'l20']  n_heads=128  head_dim=128
+[init] decoder: 2 layers, 8 heads  |  CSA memory: 256 chunks [132] uint8
+[decode] 192 steps, retriever every 64 steps (topk [top-K=64], ensemble=max)
+------------------------------------------------------------
+[cycle  0] pos     8..71    |  keep 25.0% (64/256)  |  score mean=0.4910 max=0.5445
+[cycle  1] pos    72..135   |  keep 25.0% (64/256)  |  score mean=0.4910 max=0.5445
+...
+------------------------------------------------------------
+[done] 192 tokens, 3 cycles, avg keep/cycle: 25.0%  →  ~75% CSA KV dropped
+[note] Dropped chunks are masked to -inf in attention (= KV not recalled to GPU).
+```
+> As in `demo.py`, the scores come from **random mock K** and cluster near 0.5;
+> they are only meaningful on real CSA keys. The toy's value is the *control flow*
+> — watch each retrieval cycle report how many chunks were scored, recalled, and
+> masked out.
+---
+## License
+MIT — see [`LICENSE`](./LICENSE).

demo.py ADDED Viewed

	@@ -0,0 +1,133 @@

+"""
+demo.py — minimal standalone demo for the FlashMemory DS-V4 Retriever
+=====================================================================
+Builds random mock inputs, loads the FlashMemory DS-V4 joint checkpoint, runs
+a forward pass, and prints per-chunk scores plus a top-K selection summary.
+Run::
+    python demo.py --ckpt weights/flashmemory_ds_v4.safetensors
+Runs on CPU by default; pass ``--device cuda`` to use a GPU.
+"""
+from __future__ import annotations
+import argparse
+import torch
+from retriever import FlashMemoryRetriever, dequant_compressed_k
+def make_mock_compressed_k(
+    batch: int,
+    n_chunks: int,
+    head_dim: int = 128,
+    device: str = "cpu",
+    seed: int = 0,
+) -> torch.Tensor:
+    """Construct a valid mock ``compressed_k`` tensor [B, N, head_dim + 4] uint8.
+    Layout per chunk: ``head_dim`` float8_e4m3 bytes followed by one float32 scale
+    (4 bytes). We build it the same way the real CSA cache stores it:
+      1. sample random key vectors, cast to float8_e4m3, view as uint8;
+      2. sample a small positive per-chunk scale, view its float32 as 4 uint8 bytes;
+      3. concatenate along the last dim.
+    """
+    g = torch.Generator(device=device).manual_seed(seed)
+    # 1) fp8 key bytes
+    k_vals = torch.randn(batch, n_chunks, head_dim, generator=g, device=device) * 0.5
+    k_fp8 = k_vals.to(torch.float8_e4m3fn)
+    fp8_bytes = k_fp8.view(torch.uint8)                       # [B, N, head_dim]
+    # 2) float32 per-chunk scale → 4 bytes
+    scale = (0.05 + 0.15 * torch.rand(batch, n_chunks, 1, generator=g, device=device)).float()
+    scale_bytes = scale.view(torch.uint8)                     # [B, N, 4]
+    compressed = torch.cat([fp8_bytes, scale_bytes], dim=-1)  # [B, N, head_dim + 4]
+    assert compressed.shape[-1] == head_dim + 4
+    return compressed.contiguous()
+def main():
+    ap = argparse.ArgumentParser(description="FlashMemory DS-V4 Retriever demo")
+    ap.add_argument("--ckpt", required=True, help="path to joint checkpoint (.pt)")
+    ap.add_argument("--device", default="cpu", help="cpu or cuda (default: cpu)")
+    ap.add_argument("--batch", type=int, default=2, help="number of decode tokens")
+    ap.add_argument("--n-chunks", type=int, default=64, help="number of compressed-K chunks")
+    ap.add_argument("--max-position", type=int, default=524288,
+                    help="RoPE table length (raise to 1048576 for 1M context)")
+    ap.add_argument("--top-k", type=int, default=16, help="top-K chunks to select")
+    ap.add_argument("--threshold", type=float, default=0.5, help="sigmoid keep threshold")
+    ap.add_argument("--ensemble", default="max", choices=["max", "mean"],
+                    help="cross-layer ensemble mode")
+    ap.add_argument("--seed", type=int, default=0)
+    args = ap.parse_args()
+    torch.manual_seed(args.seed)
+    device = args.device
+    print(f"[demo] loading checkpoint: {args.ckpt}")
+    model = FlashMemoryRetriever.from_checkpoint(
+        args.ckpt, device=device, max_position=args.max_position
+    )
+    model.eval()
+    print(f"[demo] loaded layers={model.layer_names}  n_heads={model.n_heads}  "
+          f"head_dim={model.head_dim}  max_position={model.max_position}")
+    # ── Mock inputs ─────────────────────────────────────────────────────────
+    B, N = args.batch, args.n_chunks
+    hidden = torch.randn(B, 4096, device=device, dtype=torch.float32)
+    compressed_k = make_mock_compressed_k(B, N, head_dim=model.head_dim,
+                                          device=device, seed=args.seed)
+    # token positions for each decode token (arbitrary; here spaced out)
+    positions = torch.arange(B, device=device, dtype=torch.int64) * 1000 + 4096
+    print(f"\n[demo] mock inputs: hidden={tuple(hidden.shape)} "
+          f"compressed_k={tuple(compressed_k.shape)} ({compressed_k.dtype}) "
+          f"positions={positions.tolist()}")
+    # sanity: show dequant works
+    k_float = dequant_compressed_k(compressed_k, head_dim=model.head_dim)
+    print(f"[demo] dequantized K: shape={tuple(k_float.shape)} "
+          f"mean={k_float.mean().item():+.4f} std={k_float.std().item():.4f}")
+    # ── Per-layer scores ──────────────────────────────────────────────────────
+    per_layer = model(hidden, compressed_k, positions, apply_sigmoid=True)
+    print("\n[demo] per-layer sigmoid score stats (over all chunks):")
+    for name, s in per_layer.items():
+        print(f"    {name}: min={s.min().item():.4f} mean={s.mean().item():.4f} "
+              f"max={s.max().item():.4f}")
+    # ── Cross-layer ensemble ──────────────────────────────────────────────────
+    scores = model.ensemble(hidden, compressed_k, positions, mode=args.ensemble)  # [B, N]
+    print(f"\n[demo] ensembled ({args.ensemble}) per-chunk scores [B={B}, N={N}]:")
+    for b in range(B):
+        row = scores[b]
+        preview = ", ".join(f"{v:.3f}" for v in row[:12].tolist())
+        print(f"    row {b}: [{preview}{', ...' if N > 12 else ''}]")
+    # ── Selection: threshold ──────────────────────────────────────────────────
+    keep_thr = model.select_topk(hidden, compressed_k, positions,
+                                 threshold=args.threshold, mode=args.ensemble)
+    print(f"\n[demo] threshold selection (sigmoid > {args.threshold}):")
+    for b in range(B):
+        n_keep = int(keep_thr[b].sum().item())
+        print(f"    row {b}: keep {n_keep}/{N} chunks  (keep ratio {n_keep / N:.1%})")
+    # ── Selection: top-K ──────────────────────────────────────────────────────
+    keep_topk = model.select_topk(hidden, compressed_k, positions,
+                                  top_k=args.top_k, mode=args.ensemble)
+    print(f"\n[demo] top-K selection (k={args.top_k}):")
+    for b in range(B):
+        idx = keep_topk[b].nonzero(as_tuple=True)[0].tolist()
+        print(f"    row {b}: kept chunk indices = {idx}")
+    print("\n[demo] done. ✅  forward + scoring + selection all ran.")
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+torch>=2.1
+safetensors
+numpy

retriever.py ADDED Viewed

	@@ -0,0 +1,505 @@

+"""
+retriever.py — FlashMemory DS-V4 Retriever (standalone reference implementation)
+===============================================================================
+A self-contained, dependency-light (torch only) PyTorch reference implementation
+of the **FlashMemory Retriever** used for sparsifying the DeepSeek-V4
+Compressed-Sparse-Attention (CSA) KV cache.
+Given the hidden state of a decode token, the retriever predicts which CSA
+KV-cache chunks the next tokens will attend to, so that only the top-scoring
+chunks need to stay resident on the GPU.
+    compressed_k [B, N, 132] uint8  →  dequant  →  k  [B, N, HEAD_DIM]
+    hidden [B, 4096]  →  q-proj + RoPE + Hadamard  →  q  [B, N_HEADS, HEAD_DIM]
+                         → weights_proj  →  fused_w  [B, N_HEADS]
+    score = sigmoid( (relu(k @ q^T) · fused_w).sum(heads) )  ∈ [0, 1]
+The shipped checkpoint is a *joint* checkpoint holding three independent CSA
+layers (l10 / l12 / l20). At inference time the per-layer sigmoid scores are
+ensembled per chunk (cross-layer ``max`` by default, ``mean`` also supported).
+This file only depends on ``torch``. The full sglang serving integration
+(KV-cache swap, attention-sink, threshold fallback, per-request routing) is
+NOT part of this open release because it depends on the internal DeepSeek-V4
+CSA framework.
+"""
+from __future__ import annotations
+import math
+from collections import OrderedDict
+from typing import Dict, List, Optional, Union
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# ─────────────────────────────────────────────────────────────────────────────
+#                       RoPE (YaRN) + Hadamard utilities
+#        (copied from the project's utils.py so this release is self-contained)
+# ─────────────────────────────────────────────────────────────────────────────
+def _yarn_find_correction_dim(n_rot: float, d_model: int, base: float, max_pos: int) -> float:
+    return (d_model * math.log(max_pos / (n_rot * 2 * math.pi))) / (2 * math.log(base))
+def precompute_freqs_cis(
+    dim: int,
+    seqlen: int,
+    base: float,
+    factor: float,
+    original_seq_len: int,
+    beta_fast: float,
+    beta_slow: float,
+) -> torch.Tensor:
+    """YaRN RoPE frequency precomputation.
+    Returns:
+        freqs_cis: [seqlen, dim // 2]  complex64
+    """
+    low = max(math.floor(_yarn_find_correction_dim(beta_fast, dim, base, original_seq_len)), 0)
+    high = min(math.ceil(_yarn_find_correction_dim(beta_slow, dim, base, original_seq_len)), dim // 2 - 1)
+    freqs = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))  # [dim//2]
+    ramp = torch.zeros(dim // 2)
+    for i in range(dim // 2):
+        if i < low:
+            ramp[i] = 0.0
+        elif i >= high:
+            ramp[i] = 1.0
+        else:
+            ramp[i] = (i - low) / max(high - low, 1)
+    mixed = freqs * (1 - ramp) + (freqs / factor) * ramp   # [dim//2]
+    t = torch.arange(seqlen, dtype=torch.float32)
+    angles = torch.outer(t, mixed)                         # [seqlen, dim//2]
+    return torch.polar(torch.ones_like(angles), angles)    # complex64
+def apply_rope(
+    q: torch.Tensor,
+    freqs_cis: torch.Tensor,
+    positions: torch.Tensor,
+    rope_dim: int = 64,
+) -> torch.Tensor:
+    """Pure-PyTorch RoPE applied to the last ``rope_dim`` dims of ``q``.
+    Args:
+        q:         [B, n_heads, head_dim]
+        freqs_cis: [max_pos, rope_dim // 2]  complex64
+        positions: [B]  int64
+        rope_dim:  number of trailing dims to rotate (applied to q[..., -rope_dim:])
+    Returns:
+        q after RoPE, same shape as input.
+    """
+    head_dim = q.shape[-1]
+    q_pass = q[..., : head_dim - rope_dim]
+    q_rope = q[..., head_dim - rope_dim:]
+    q_c = torch.view_as_complex(
+        q_rope.float().reshape(*q_rope.shape[:-1], rope_dim // 2, 2).contiguous()
+    )  # [B, H, rope_dim//2]
+    # Clamp positions into the RoPE table range. The freqs_cis table covers
+    # max_position entries; tokens beyond it get clamped to the last entry
+    # (YaRN extrapolation already makes the tail an approximation, so a few
+    # clamped ultra-long positions are far better than an out-of-bounds gather).
+    positions = positions.clamp(0, freqs_cis.shape[0] - 1)
+    freqs = freqs_cis[positions].unsqueeze(1)  # [B, 1, rope_dim//2]
+    q_rot = torch.view_as_real(q_c * freqs).reshape(*q_rope.shape).to(q.dtype)
+    return torch.cat([q_pass, q_rot], dim=-1)
+def hadamard_transform(x: torch.Tensor) -> torch.Tensor:
+    """Normalized Walsh-Hadamard transform over the last dim (must be a power of 2).
+    x: [..., d] → [..., d]  (normalized by 1/sqrt(d))
+    """
+    *leading, d = x.shape
+    assert d > 0 and (d & (d - 1)) == 0, f"last dim {d} must be a power of 2"
+    h = x.float()
+    s = 1
+    while s < d:
+        h = h.view(*leading, d // (2 * s), 2, s)
+        a, b = h[..., 0, :], h[..., 1, :]
+        h = torch.stack([a + b, a - b], dim=-2).view(*leading, d)
+        s *= 2
+    return h / math.sqrt(d)
+# ─────────────────────────────────────────────────────────────────────────────
+#                         compressed-K dequantization
+# ─────────────────────────────────────────────────────────────────────────────
+def dequant_compressed_k(compressed_k: torch.Tensor, head_dim: int = 128) -> torch.Tensor:
+    """Dequantize compressed CSA keys.
+    Each compressed key is ``head_dim + 4`` bytes:
+        bytes[:head_dim]      — float8_e4m3 quantized key values (1 byte each)
+        bytes[head_dim:+4]    — a single float32 per-chunk scale
+    Args:
+        compressed_k: [..., head_dim + 4]  uint8
+        head_dim:     number of key dims (default 128)
+    Returns:
+        k: [..., head_dim]  float32   ( = fp8_values * scale )
+    """
+    assert compressed_k.dtype == torch.uint8, (
+        f"compressed_k must be uint8, got {compressed_k.dtype}"
+    )
+    assert compressed_k.shape[-1] == head_dim + 4, (
+        f"compressed_k last dim must be {head_dim + 4}, got {compressed_k.shape[-1]}"
+    )
+    fp8_bytes = compressed_k[..., :head_dim].contiguous()       # uint8 [..., head_dim]
+    k_fp8 = fp8_bytes.view(torch.float8_e4m3fn).float()         # [..., head_dim]
+    scale_bytes = compressed_k[..., head_dim:head_dim + 4].contiguous()  # uint8 [..., 4]
+    scale = scale_bytes.view(torch.float32)                     # [..., 1]
+    return k_fp8 * scale                                        # broadcast → [..., head_dim]
+# ─────────────────────────────────────────────────────────────────────────────
+#                           per-layer scorer module
+# ─────────────────────────────────────────────────────────────────────────────
+class _LayerScorer(nn.Module):
+    """Holds one CSA layer's retriever weights and computes its logits.
+    Weights are stored as (non-trainable) buffers so ``.to(device)`` / ``.half()``
+    move them along with the parent module.
+    """
+    def __init__(
+        self,
+        wq_a: torch.Tensor,          # [Q_LORA_RANK, 4096]
+        wq_b: torch.Tensor,          # [N_HEADS * HEAD_DIM, Q_LORA_RANK]
+        q_norm_weight: torch.Tensor, # [Q_LORA_RANK]
+        weights_proj: torch.Tensor,  # [N_HEADS, 4096]
+        n_heads: int,
+        head_dim: int,
+        rope_dim: int,
+        rms_norm_eps: float,
+        weight_scale: float,
+    ):
+        super().__init__()
+        self.register_buffer("wq_a", wq_a.to(torch.float32), persistent=False)
+        self.register_buffer("wq_b", wq_b.to(torch.float32), persistent=False)
+        self.register_buffer("q_norm_weight", q_norm_weight.to(torch.float32), persistent=False)
+        self.register_buffer("weights_proj", weights_proj.to(torch.float32), persistent=False)
+        self.n_heads = n_heads
+        self.head_dim = head_dim
+        self.rope_dim = rope_dim
+        self.rms_norm_eps = rms_norm_eps
+        self.weight_scale = weight_scale
+    def _rmsnorm(self, x: torch.Tensor) -> torch.Tensor:
+        x_f = x.float()
+        norm = torch.sqrt(x_f.pow(2).mean(dim=-1, keepdim=True) + self.rms_norm_eps)
+        return x_f / norm * self.q_norm_weight
+    @torch.no_grad()
+    def logits(
+        self,
+        hidden: torch.Tensor,    # [B, 4096]
+        k_float: torch.Tensor,   # [B, N, head_dim]  (already dequantized)
+        positions: torch.Tensor, # [B]  int64
+        freqs_cis: torch.Tensor, # [max_pos, rope_dim//2]  complex64
+    ) -> torch.Tensor:
+        """Return raw (pre-sigmoid) logits [B, N] for this layer."""
+        x = hidden.float()
+        B = x.shape[0]
+        # ── Q side ──────────────────────────────────────────────────────────
+        q_lora = F.linear(x, self.wq_a)               # [B, Q_LORA_RANK]
+        q_lora = self._rmsnorm(q_lora)                # [B, Q_LORA_RANK]
+        q = F.linear(q_lora, self.wq_b)               # [B, N_HEADS * HEAD_DIM]
+        q = q.view(B, self.n_heads, self.head_dim)    # [B, N_HEADS, HEAD_DIM]
+        # RoPE is applied in bf16 then cast back to float32 to match the trained
+        # / deployed scoring path exactly.
+        q = apply_rope(q.to(torch.bfloat16), freqs_cis, positions.to(torch.int64),
+                       rope_dim=self.rope_dim).float()
+        q = hadamard_transform(q)                     # [B, N_HEADS, HEAD_DIM]
+        per_head_w = F.linear(x, self.weights_proj)   # [B, N_HEADS]
+        fused_w = per_head_w * self.weight_scale      # [B, N_HEADS]
+        # ── Score: relu(k @ q^T) weighted-sum over heads ────────────────────
+        # q: [B, H, D], k_float: [B, N, D] → [B, N, H]
+        scores_per_head = F.relu(torch.einsum("bhd,bnd->bnh", q, k_float))  # [B, N, H]
+        logits = (scores_per_head * fused_w.unsqueeze(1)).sum(-1)           # [B, N]
+        return logits
+# ─────────────────────────────────────────────────────────────────────────────
+#                          FlashMemoryRetriever
+# ─────────────────────────────────────────────────────────────────────────────
+class FlashMemoryRetriever(nn.Module):
+    """Multi-layer FlashMemory retriever (joint checkpoint).
+    Loads a joint checkpoint whose state-dict keys look like
+    ``retrievers.l10.wq_a.weight``, builds one ``_LayerScorer`` per CSA layer,
+    and scores compressed-K chunks against a decode token's hidden state.
+    Typical usage::
+        model = FlashMemoryRetriever.from_checkpoint("flashmemory_ds_v4.safetensors",
+                                                      device="cuda")
+        per_layer = model(hidden_state, compressed_k, positions)  # {"l10": [B,N], ...}
+        scores = model.ensemble(hidden_state, compressed_k, positions, mode="max")  # [B,N]
+    """
+    # RoPE / normalization constants (identical across all CSA layers).
+    HEAD_DIM = 128
+    ROPE_DIM = 64
+    ROPE_BASE = 160000.0
+    ROPE_FACTOR = 16.0
+    ROPE_ORIGINAL_SEQ_LEN = 65536
+    ROPE_BETA_FAST = 32.0
+    ROPE_BETA_SLOW = 1.0
+    RMS_NORM_EPS = 1e-6
+    def __init__(
+        self,
+        layer_states: "OrderedDict[str, Dict[str, torch.Tensor]]",
+        device: Union[str, torch.device] = "cpu",
+        max_position: int = 524288,
+        head_dim: Optional[int] = None,
+    ):
+        """
+        Args:
+            layer_states: ordered mapping ``layer_name -> {"wq_a.weight": ...,
+                "wq_b.weight": ..., "q_norm_weight": ..., "weights_proj.weight": ...}``.
+                Layer names are arbitrary (e.g. ``"l10"``); ordering is preserved.
+            device: device to place the model on.
+            max_position: RoPE table length. Must cover the largest token position
+                ever scored; positions beyond it are clamped (RoPE becomes an
+                approximation). Default 524288; can be raised to 1_048_576 (1M) for
+                full-length DeepSeek-V4 contexts.
+            head_dim: key/head dimension. Defaults to ``HEAD_DIM`` (128).
+        """
+        super().__init__()
+        assert layer_states, "FlashMemoryRetriever needs at least one layer"
+        device = torch.device(device)
+        self.head_dim = head_dim if head_dim is not None else self.HEAD_DIM
+        self.max_position = max_position
+        self.layer_names: List[str] = list(layer_states.keys())
+        # Precompute the (shared) YaRN RoPE table once.
+        freqs_cis = precompute_freqs_cis(
+            dim=self.ROPE_DIM,
+            seqlen=max_position,
+            base=self.ROPE_BASE,
+            factor=self.ROPE_FACTOR,
+            original_seq_len=self.ROPE_ORIGINAL_SEQ_LEN,
+            beta_fast=self.ROPE_BETA_FAST,
+            beta_slow=self.ROPE_BETA_SLOW,
+        )
+        self.register_buffer("freqs_cis", freqs_cis, persistent=False)
+        # Build one scorer per layer.
+        self.scorers = nn.ModuleDict()
+        for name, st in layer_states.items():
+            wq_b = st["wq_b.weight"]
+            n_heads = wq_b.shape[0] // self.head_dim
+            weight_scale = self.head_dim ** -0.5 * n_heads ** -0.5
+            self.scorers[name] = _LayerScorer(
+                wq_a=st["wq_a.weight"],
+                wq_b=wq_b,
+                q_norm_weight=st["q_norm_weight"],
+                weights_proj=st["weights_proj.weight"],
+                n_heads=n_heads,
+                head_dim=self.head_dim,
+                rope_dim=self.ROPE_DIM,
+                rms_norm_eps=self.RMS_NORM_EPS,
+                weight_scale=weight_scale,
+            )
+        self.n_heads = next(iter(self.scorers.values())).n_heads
+        self.to(device)
+    # ── construction helpers ────────────────────────────────────────────────
+    @staticmethod
+    def _split_joint_state(
+        state: Dict[str, torch.Tensor],
+        layers: Optional[List[str]] = None,
+    ) -> "OrderedDict[str, Dict[str, torch.Tensor]]":
+        """Split a joint state-dict (keys ``retrievers.l{ID}.*``) into per-layer dicts."""
+        is_joint = any(k.startswith("retrievers.") for k in state.keys())
+        if not is_joint:
+            raise ValueError(
+                "State dict is not in joint 'retrievers.l{ID}.*' format. "
+                f"Got keys e.g. {list(state.keys())[:3]}"
+            )
+        found = sorted({k.split(".")[1] for k in state if k.startswith("retrievers.")})
+        use_layers = layers if layers is not None else found
+        out: "OrderedDict[str, Dict[str, torch.Tensor]]" = OrderedDict()
+        wanted = ("wq_a.weight", "wq_b.weight", "q_norm_weight", "weights_proj.weight")
+        for lname in use_layers:
+            prefix = f"retrievers.{lname}."
+            sub = {k[len(prefix):]: v for k, v in state.items() if k.startswith(prefix)}
+            if not sub:
+                raise ValueError(
+                    f"Layer {lname!r} not found in checkpoint. Available: {found}"
+                )
+            missing = [w for w in wanted if w not in sub]
+            if missing:
+                raise ValueError(f"Layer {lname!r} missing weights {missing}")
+            out[lname] = {w: sub[w] for w in wanted}
+        return out
+    @classmethod
+    def from_checkpoint(
+        cls,
+        ckpt_path: str,
+        device: Union[str, torch.device] = "cpu",
+        max_position: int = 524288,
+        layers: Optional[List[str]] = None,
+    ) -> "FlashMemoryRetriever":
+        """Load a joint checkpoint and build the retriever.
+        Supports both ``.pt`` (``torch.save`` state-dict) and ``.safetensors``
+        (HuggingFace convention). Only the learned weights (``wq_a/wq_b/
+        q_norm_weight/weights_proj``) are read; the RoPE ``freqs_cis`` table is
+        recomputed locally, so a slim ``.safetensors`` loads identically.
+        Args:
+            ckpt_path: path to the joint checkpoint (``.pt`` or ``.safetensors``).
+            device: device to load onto.
+            max_position: RoPE table length (see ``__init__``).
+            layers: optional subset of layer names (e.g. ``["l10", "l20"]``). If
+                None, all layers found in the checkpoint are used.
+        """
+        if str(ckpt_path).endswith(".safetensors"):
+            from safetensors.torch import load_file
+            state = load_file(ckpt_path, device="cpu")
+        else:
+            state = torch.load(ckpt_path, map_location="cpu", weights_only=True)
+        layer_states = cls._split_joint_state(state, layers=layers)
+        return cls(layer_states, device=device, max_position=max_position)
+    # ── inference ────────────────────────────────────────────────────────────
+    @torch.no_grad()
+    def forward(
+        self,
+        hidden_state: torch.Tensor,   # [B, 4096]
+        compressed_k: torch.Tensor,   # [B, N, head_dim + 4]  uint8
+        positions: torch.Tensor,      # [B]  int64
+        apply_sigmoid: bool = True,
+    ) -> "OrderedDict[str, torch.Tensor]":
+        """Score the compressed-K chunks with every CSA layer.
+        Args:
+            hidden_state: [B, 4096] decode-token hidden states.
+            compressed_k: [B, N, head_dim + 4] uint8 compressed keys (shared across
+                layers in this reference impl — see note below).
+            positions: [B] int64 token positions (for RoPE).
+            apply_sigmoid: if True (default) return sigmoid scores ∈ [0, 1];
+                if False return raw logits.
+        Returns:
+            OrderedDict ``{layer_name: scores [B, N]}``.
+        Note:
+            In the production DeepSeek-V4 CSA system each layer has its *own*
+            compressed-K buffer. This reference impl scores all layers against the
+            single ``compressed_k`` you pass, which is the right behavior for the
+            standalone algorithm demo. If you have per-layer K, call this once per
+            layer with that layer's K, or use ``score_layer``.
+        """
+        device = self.freqs_cis.device
+        hidden_state = hidden_state.to(device)
+        compressed_k = compressed_k.to(device)
+        positions = positions.to(device)
+        k_float = dequant_compressed_k(compressed_k, head_dim=self.head_dim)  # [B, N, D]
+        out: "OrderedDict[str, torch.Tensor]" = OrderedDict()
+        for name, scorer in self.scorers.items():
+            logits = scorer.logits(hidden_state, k_float, positions, self.freqs_cis)
+            out[name] = torch.sigmoid(logits) if apply_sigmoid else logits
+        return out
+    @torch.no_grad()
+    def score_layer(
+        self,
+        layer_name: str,
+        hidden_state: torch.Tensor,
+        compressed_k: torch.Tensor,
+        positions: torch.Tensor,
+        apply_sigmoid: bool = True,
+    ) -> torch.Tensor:
+        """Score a single layer (useful when each layer has its own K)."""
+        device = self.freqs_cis.device
+        k_float = dequant_compressed_k(compressed_k.to(device), head_dim=self.head_dim)
+        logits = self.scorers[layer_name].logits(
+            hidden_state.to(device), k_float, positions.to(device), self.freqs_cis
+        )
+        return torch.sigmoid(logits) if apply_sigmoid else logits
+    @torch.no_grad()
+    def ensemble(
+        self,
+        hidden_state: torch.Tensor,
+        compressed_k: torch.Tensor,
+        positions: torch.Tensor,
+        mode: str = "max",
+    ) -> torch.Tensor:
+        """Cross-layer ensemble of per-chunk sigmoid scores.
+        Args:
+            mode: ``"max"`` (default) or ``"mean"`` over the per-layer sigmoid
+                scores, per chunk.
+        Returns:
+            scores [B, N]  ∈ [0, 1].
+        """
+        assert mode in ("max", "mean"), f"unknown ensemble mode: {mode!r}"
+        per_layer = self.forward(hidden_state, compressed_k, positions, apply_sigmoid=True)
+        stacked = torch.stack(list(per_layer.values()), dim=0)  # [L, B, N]
+        if mode == "max":
+            return stacked.amax(dim=0)
+        return stacked.mean(dim=0)
+    @torch.no_grad()
+    def select_topk(
+        self,
+        hidden_state: torch.Tensor,
+        compressed_k: torch.Tensor,
+        positions: torch.Tensor,
+        top_k: Optional[int] = None,
+        threshold: Optional[float] = None,
+        mode: str = "max",
+    ) -> torch.Tensor:
+        """Return a boolean keep-mask [B, N] of selected chunks.
+        Exactly one of ``top_k`` / ``threshold`` should be given. With ``top_k``
+        the top-k highest-scoring chunks per row are kept; with ``threshold`` all
+        chunks whose ensembled sigmoid score exceeds the threshold are kept.
+        """
+        scores = self.ensemble(hidden_state, compressed_k, positions, mode=mode)  # [B, N]
+        B, N = scores.shape
+        if (top_k is None) == (threshold is None):
+            raise ValueError("Provide exactly one of top_k or threshold")
+        if threshold is not None:
+            return scores > threshold
+        k = min(top_k, N)
+        keep = torch.zeros(B, N, dtype=torch.bool, device=scores.device)
+        idx = scores.topk(k, dim=-1).indices
+        keep.scatter_(1, idx, True)
+        return keep

toy_flashmemory_inference.py ADDED Viewed

	@@ -0,0 +1,312 @@

+"""
+toy_flashmemory_inference.py — Toy sparse-decode loop driven by the FlashMemory Retriever
+=========================================================================================
+A minimal, torch-only illustration of how the FlashMemory Retriever controls CSA
+memory recall during decode. Every 64 steps the retriever scores all N compressed-K
+chunks against the current decode hidden state, selects the top-K (or thresholded)
+ones to keep, and the rest are masked from attention — exactly as if their KV were
+never recalled onto the GPU.
+This is NOT a real DeepSeek-V4. The "decoder" is a few toy layers with random
+weights. But the retriever, its scoring math, and the decode-time control flow
+are all real.
+Run::
+    python toy_flashmemory_inference.py --ckpt weights/flashmemory_ds_v4.safetensors
+"""
+from __future__ import annotations
+import argparse
+import math
+import os
+import sys
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# Ensure sibling retriever.py is importable (works from any cwd).
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from retriever import FlashMemoryRetriever, dequant_compressed_k  # noqa: E402
+HIDDEN_DIM = 4096  # fixed: the retriever consumes a [B, 4096] decode hidden state
+# ─────────────────────────────────────────────────────────────────────────────
+#   Mock CSA KV-cache:  N compressed chunks, each [head_dim + 4] uint8
+#   (this is the *indexer's* quantized-K representation that the retriever scores)
+# ─────────────────────────────────────────────────────────────────────────────
+def make_mock_compressed_k(
+    batch: int,
+    n_chunks: int,
+    head_dim: int = 128,
+    device: str = "cpu",
+    seed: int = 0,
+) -> torch.Tensor:
+    """Build a valid mock ``compressed_k`` tensor ``[B, N, head_dim + 4]`` uint8.
+    This mirrors how the real CSA cache stores a compressed key per chunk:
+        bytes[:head_dim]      — float8_e4m3 quantized key values (1 byte each)
+        bytes[head_dim:+4]    — one float32 per-chunk dequant scale
+    In a real FlashMemory run these bytes are produced during *prefill*, when the
+    historical KV is compressed and stored. Here we just sample them randomly —
+    the retriever still runs its exact scoring path over them.
+    """
+    g = torch.Generator(device=device).manual_seed(seed)
+    # 1) fp8 key bytes
+    k_vals = torch.randn(batch, n_chunks, head_dim, generator=g, device=device) * 0.5
+    fp8_bytes = k_vals.to(torch.float8_e4m3fn).view(torch.uint8)          # [B, N, head_dim]
+    # 2) float32 per-chunk scale → 4 uint8 bytes
+    scale = (0.05 + 0.15 * torch.rand(batch, n_chunks, 1, generator=g, device=device)).float()
+    scale_bytes = scale.view(torch.uint8)                                 # [B, N, 4]
+    compressed = torch.cat([fp8_bytes, scale_bytes], dim=-1)              # [B, N, head_dim + 4]
+    assert compressed.shape[-1] == head_dim + 4
+    return compressed.contiguous()
+# ─────────────────────────────────────────────────────────────────────────────
+#   Toy decoder (random weights).  Only exists to emit a [B,4096] hidden state
+#   each step and own a memory cross-attention over N CSA chunks that the
+#   retriever's keep-mask sparsifies.
+# ─────────────────────────────────────────────────────────────────────────────
+def _rmsnorm(x: torch.Tensor, weight: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
+    norm = torch.rsqrt(x.float().pow(2).mean(-1, keepdim=True) + eps)
+    return (x.float() * norm).to(x.dtype) * weight
+class ToyMemoryDecoder(nn.Module):
+    """A few layers of toy memory cross-attention + MLP (random weights)."""
+    def __init__(
+        self,
+        n_chunks: int,
+        n_layers: int = 2,
+        n_heads: int = 8,
+        vocab_size: int = 512,
+        device: str = "cpu",
+        seed: int = 0,
+    ):
+        super().__init__()
+        torch.manual_seed(seed)
+        self.hidden_dim = HIDDEN_DIM
+        self.n_layers = n_layers
+        self.n_heads = n_heads
+        self.head_dim = self.hidden_dim // n_heads
+        self.n_chunks = n_chunks
+        # Token embedding (toy; vocab is meaningless).
+        self.embed = nn.Embedding(vocab_size, self.hidden_dim)
+        # Decoder-space memory bank: one vector per CSA chunk (separate from the
+        # retriever's compressed_k — both index the same N chunks).
+        self.register_buffer("memory", torch.randn(n_chunks, self.hidden_dim) * 0.02)
+        # Per-layer projections + norms.
+        self.wq = nn.ModuleList(nn.Linear(self.hidden_dim, self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.wk = nn.ModuleList(nn.Linear(self.hidden_dim, self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.wv = nn.ModuleList(nn.Linear(self.hidden_dim, self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.wo = nn.ModuleList(nn.Linear(self.hidden_dim, self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.mlp_up = nn.ModuleList(nn.Linear(self.hidden_dim, 2 * self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.mlp_down = nn.ModuleList(nn.Linear(2 * self.hidden_dim, self.hidden_dim, bias=False) for _ in range(n_layers))
+        self.attn_norm = nn.ParameterList(nn.Parameter(torch.ones(self.hidden_dim)) for _ in range(n_layers))
+        self.mlp_norm = nn.ParameterList(nn.Parameter(torch.ones(self.hidden_dim)) for _ in range(n_layers))
+        self.final_norm = nn.Parameter(torch.ones(self.hidden_dim))
+        self.lm_head = nn.Linear(self.hidden_dim, vocab_size, bias=False)
+        self.to(device)
+        self.eval()
+    @torch.no_grad()
+    def _memory_attention(self, x: torch.Tensor, layer: int, keep_mask: torch.Tensor | None) -> torch.Tensor:
+        """Cross-attention of the current token(s) over the N memory chunks.
+        Args:
+            x:         [B, hidden] current-token hidden state(s).
+            keep_mask: [B, N] bool, True = chunk recalled/kept. ``None`` = keep all
+                       (the dense path used during prefill / cold-start).
+        Chunks with ``keep_mask == False`` get their attention logit set to
+        ``-inf`` → softmax weight 0 → they contribute nothing. THIS is our
+        simulation of "the chunk was not recalled onto the GPU".
+        """
+        B = x.shape[0]
+        H, D = self.n_heads, self.head_dim
+        q = self.wq[layer](x).view(B, H, 1, D)                              # [B, H, 1, D]
+        k = self.wk[layer](self.memory).view(self.n_chunks, H, D).permute(1, 0, 2)  # [H, N, D]
+        v = self.wv[layer](self.memory).view(self.n_chunks, H, D).permute(1, 0, 2)  # [H, N, D]
+        # [B, H, 1, N] attention logits over the N memory chunks.
+        logits = torch.einsum("bhqd,hnd->bhqn", q, k) / math.sqrt(D)
+        if keep_mask is not None:
+            # Broadcast [B, N] → [B, 1, 1, N] and mask the dropped chunks.
+            drop = ~keep_mask.view(B, 1, 1, self.n_chunks)
+            logits = logits.masked_fill(drop, float("-inf"))
+        attn = torch.softmax(logits, dim=-1)                                # [B, H, 1, N]
+        out = torch.einsum("bhqn,hnd->bhqd", attn, v).reshape(B, self.hidden_dim)
+        return self.wo[layer](out)
+    @torch.no_grad()
+    def step(
+        self,
+        token_ids: torch.Tensor,          # [B] int64
+        keep_mask: torch.Tensor | None,   # [B, N] bool, or None for dense
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """One decode step. Returns (hidden [B, 4096], next-token logits [B, vocab])."""
+        x = self.embed(token_ids)                                           # [B, hidden]
+        for layer in range(self.n_layers):
+            x = x + self._memory_attention(_rmsnorm(x, self.attn_norm[layer]), layer, keep_mask)
+            h = _rmsnorm(x, self.mlp_norm[layer])
+            x = x + self.mlp_down[layer](F.gelu(self.mlp_up[layer](h)))
+        hidden = _rmsnorm(x, self.final_norm)                               # [B, 4096] ← feeds retriever
+        return hidden, self.lm_head(hidden)
+    @torch.no_grad()
+    def prefill(self, prefill_ids: torch.Tensor) -> torch.Tensor:
+        """Toy 'prefill': run a short prompt through DENSE memory attention.
+        Returns the last token's hidden state, which seeds the very first
+        retrieval cycle (the indexer needs a query hidden state to score against).
+        Prefill is intentionally dense (keep_mask=None): the model sees the whole
+        history before decoding begins.
+        """
+        hidden = None
+        for t in range(prefill_ids.shape[1]):
+            hidden, _ = self.step(prefill_ids[:, t], keep_mask=None)
+        return hidden                                                       # [B, 4096]
+# ─────────────────────────────────────────────────────────────────────────────
+#   Retrieval helper: scores → keep-mask (top-K or threshold)
+# ─────────────────────────────────────────────────────────────────────────────
+def scores_to_keep_mask(
+    scores: torch.Tensor,           # [B, N] sigmoid scores ∈ [0, 1]
+    select_mode: str,
+    top_k: int,
+    threshold: float,
+) -> torch.Tensor:
+    """Turn per-chunk retriever scores into a boolean keep-mask [B, N]."""
+    B, N = scores.shape
+    if select_mode == "topk":
+        k = min(top_k, N)
+        keep = torch.zeros(B, N, dtype=torch.bool, device=scores.device)
+        idx = scores.topk(k, dim=-1).indices
+        keep.scatter_(1, idx, True)
+        return keep
+    elif select_mode == "threshold":
+        return scores > threshold
+    raise ValueError(f"unknown select_mode: {select_mode!r}")
+# ─────────────────────────────────────────────────────────────────────────────
+#                                   main
+# ─────────────────────────────────────────────────────────────────────────────
+def main():
+    ap = argparse.ArgumentParser(
+        description="Toy DeepSeek-V4-FlashMemory sparse-decode loop driven by the FlashMemory Retriever"
+    )
+    ap.add_argument("--ckpt", required=True, help="path to the FlashMemory DS-V4 joint checkpoint (.pt)")
+    ap.add_argument("--device", default="cpu", help="cpu or cuda (default: cpu)")
+    ap.add_argument("--batch", type=int, default=1, help="number of parallel decode sequences")
+    ap.add_argument("--n-chunks", type=int, default=256, help="number of CSA memory chunks (the long history)")
+    ap.add_argument("--steps", type=int, default=192, help="number of decode steps to generate")
+    ap.add_argument("--retrieval-interval", type=int, default=64,
+                    help="run the retriever every N decode steps (FlashMemory default 64)")
+    ap.add_argument("--select-mode", default="topk", choices=["topk", "threshold"],
+                    help="how to turn scores into a keep-mask")
+    ap.add_argument("--top-k", type=int, default=64, help="chunks to recall per cycle (select-mode=topk)")
+    ap.add_argument("--threshold", type=float, default=0.5, help="sigmoid keep threshold (select-mode=threshold)")
+    ap.add_argument("--ensemble", default="max", choices=["max", "mean"], help="cross-layer ensemble mode")
+    ap.add_argument("--max-position", type=int, default=524288, help="RoPE table length")
+    ap.add_argument("--n-layers", type=int, default=2, help="toy decoder layers")
+    ap.add_argument("--seed", type=int, default=0)
+    args = ap.parse_args()
+    torch.manual_seed(args.seed)
+    device = args.device
+    B, N = args.batch, args.n_chunks
+    # ── 1. Load retriever ──────────────────────────────────────────────────────
+    print(f"FlashMemory DS-V4 — toy sparse-decode loop")
+    print(f"[load] {args.ckpt}")
+    retriever = FlashMemoryRetriever.from_checkpoint(
+        args.ckpt, device=device, max_position=args.max_position
+    )
+    retriever.eval()
+    print(f"[load] layers={retriever.layer_names}  n_heads={retriever.n_heads}  "
+          f"head_dim={retriever.head_dim}")
+    # ── 2. Build toy decoder + mock CSA memory ─────────────────────────────────
+    decoder = ToyMemoryDecoder(n_chunks=N, n_layers=args.n_layers, device=device, seed=args.seed)
+    compressed_k = make_mock_compressed_k(B, N, head_dim=retriever.head_dim,
+                                          device=device, seed=args.seed)
+    print(f"[init] decoder: {args.n_layers} layers, {decoder.n_heads} heads  |  "
+          f"CSA memory: {N} chunks [{retriever.head_dim + 4}] uint8")
+    # ── 3. Prefill ─────────────────────────────────────────────────────────────
+    prefill_len = 8
+    prefill_ids = torch.randint(0, 512, (B, prefill_len), device=device)
+    last_hidden = decoder.prefill(prefill_ids)
+    base_pos = prefill_len
+    last_pos = torch.full((B,), prefill_len - 1, dtype=torch.int64, device=device)
+    sel_desc = (f"top-K={args.top_k}" if args.select_mode == "topk"
+                else f"sigmoid>{args.threshold}")
+    print(f"\n[decode] {args.steps} steps, retriever every {args.retrieval_interval} steps "
+          f"({args.select_mode} [{sel_desc}], ensemble={args.ensemble})")
+    print("-" * 60)
+    # ── 4. Decode loop ──────────────────────────────────────────────────────────
+    keep_mask = None
+    token = decoder.embed.weight.new_zeros(B, dtype=torch.int64)
+    keep_ratios: list[float] = []
+    cycle = 0
+    for t in range(args.steps):
+        abs_pos = base_pos + t
+        if t % args.retrieval_interval == 0:
+            scores = retriever.ensemble(last_hidden, compressed_k, last_pos, mode=args.ensemble)
+            keep_mask = scores_to_keep_mask(scores, args.select_mode, args.top_k, args.threshold)
+            n_keep = keep_mask.sum(-1)
+            ratio = (n_keep.float() / N)
+            keep_ratios.extend(ratio.tolist())
+            w_lo = abs_pos
+            w_hi = min(abs_pos + args.retrieval_interval, base_pos + args.steps) - 1
+            print(f"[cycle {cycle:>2}] pos {w_lo:>5}..{w_hi:<5}  |  "
+                  f"keep {fmt_ratio(ratio, B)} ({int(n_keep[0])}/{N})  |  "
+                  f"score mean={scores.mean():.4f} max={scores.max():.4f}")
+            cycle += 1
+        hidden, logits = decoder.step(token, keep_mask)
+        token = logits.argmax(-1)
+        last_hidden = hidden
+        last_pos = torch.full((B,), abs_pos, dtype=torch.int64, device=device)
+    # ── 5. Summary ─────────────────────────────────────────────────────────────
+    avg_keep = sum(keep_ratios) / max(len(keep_ratios), 1)
+    print("-" * 60)
+    print(f"[done] {args.steps} tokens, {cycle} cycles, "
+          f"avg keep/cycle: {avg_keep:.1%}  →  ~{1 - avg_keep:.0%} CSA KV dropped")
+    print(f"[note] Dropped chunks are masked to -inf in attention (= KV not recalled to GPU).  "
+          f"Production swap engine not included in this release.")
+def fmt_ratio(t: torch.Tensor, B: int) -> str:
+    vals = t.tolist()
+    return f"{vals[0]:.1%}" if B == 1 else "[" + ", ".join(f"{v:.1%}" for v in vals) + "]"
+if __name__ == "__main__":
+    main()

weights/flashmemory_ds_v4.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ba20d264c309246f824d4471ccc637061b3b0268fe8e4eecc121474a1e5cd02a
+size 509633992