Harvest exact M-fold attention from n1

Browse files

Files changed (8) hide show

AGILLM-4.md +27 -0
N1_HARVEST.md +78 -0
README.md +3 -0
local_sweep_after_mfold_sdpa.json +22 -0
local_sweep_after_mfold_sublinear.json +22 -0
local_verify_m_fold_agillm4.json +118 -0
nB300_agillm4.py +55 -2
verify_m_fold_agillm4.py +162 -0

AGILLM-4.md CHANGED Viewed

@@ -142,6 +142,33 @@ python /workspace/agillm-4/block_sweep_agillm4.py \
 If it is stable, then compare it against SDPA at the same block size with
 `profile_agillm4.py`.
 ## Intelligence per FLOP
 Compute reduction is not enough by itself. AGILLM-4 should spend saved FLOPs on

 If it is stable, then compare it against SDPA at the same block size with
 `profile_agillm4.py`.
+## n1.py Harvest
+AGILLM-4 is now starting to import exact, proof-backed improvements from
+`C:\Users\Scott\Downloads\n1.py` while keeping the AGILLM-4 model-scale and
+long-context branch intact.
+First harvested feature: exact M-fold expansion attention. When `rank > d_k`,
+the trainer now uses:
+```text
+(q @ U) @ (k @ U).T == q @ (U @ U.T) @ k.T
+```
+This preserves the function while keeping score/cache key width at `d_k`
+instead of `rank`. The inference path caches `U @ U.T`; the training path
+recomputes it so gradients through `U` remain exact.
+Verification:
+```bash
+python /workspace/agillm-4/verify_m_fold_agillm4.py \
+  --presets pico_1x,micro_3x \
+  --backends manual,sdpa
+```
+See `N1_HARVEST.md` for the staged port order.
 ## Intelligence per FLOP
 Compute reduction is not enough by itself. AGILLM-4 should spend saved FLOPs on

N1_HARVEST.md ADDED Viewed

	@@ -0,0 +1,78 @@

+# n1.py Harvest Plan
+Goal: move proven, no-quality-loss trainer improvements from
+`C:\Users\Scott\Downloads\n1.py` into AGILLM-4 without replacing the AGILLM-4
+long-context/model-scale branch.
+## Ported
+### 1. Exact M-Fold Expansion Attention
+Status: done.
+For ranks where `rank > d_k`, AGILLM-4 now computes:
+```text
+(q @ U) @ (k @ U).T == q @ (U @ U.T) @ k.T
+```
+This keeps attention scores and KV-cache keys in `d_k` width instead of
+`rank` width while preserving the exact expanded-attention function. The
+training path recomputes `U @ U.T` with gradients, and inference/no-grad caches
+the metric until `U` changes.
+Verification:
+```bash
+python agillm-4/verify_m_fold_agillm4.py \
+  --presets pico_1x,micro_3x \
+  --backends manual,sdpa \
+  --cached_len 8 \
+  --new_len 4
+```
+The verifier checks forward output, loss, input gradients, parameter gradients,
+cached append equivalence, cache key width, and metric-cache invalidation.
+## Next Candidates
+### 2. Fused QKV Projection
+n1 fuses separate `q/k/v` linear layers into one `qkv` linear while keeping
+checkpoint compatibility by folding old state-dict keys on load. This should be
+the next port after M-fold because it reduces three projection GEMMs to one.
+Risk: checkpoint key migration. Keep this separate from the M-fold port.
+### 3. Combined ALiBi + Mask Cache
+n1 pre-folds ALiBi into the mask once per encoder forward instead of rebuilding
+the same layer-independent bias in every block.
+Risk: cache semantics differ for KV decode where the ALiBi slice changes.
+### 4. SAT Speculative Inference
+n1 has proof-covered SAT-draft / AR-verify speculative decoding. This belongs
+in AGILLM-4 after the SFT result tells us whether chat turns are sane.
+Risk: inference control flow and cache rollback complexity.
+### 5. Compact Checkpoint
+n1 can compact `U` spectra post-training and save compatible checkpoints.
+Risk: optimizer state must be dropped or remapped carefully; do only as a
+separate command, never during a live training run.
+### 6. KV Cache Buffer
+n1 replaces repeated decode-time `torch.cat` cache growth with preallocated KV
+buffers.
+Risk: cache object type touches AR, SAT, and future spec decoding paths.
+## Rule
+Every harvested feature needs its own AGILLM-4 verifier or profile artifact.
+Do not rely on n1's proof suite alone after adapting the implementation.

README.md CHANGED Viewed

@@ -19,9 +19,12 @@ and extended for:
 - longer block-size work on 24GB, B200, and B300 class GPUs
 - AR+SAT every step with sequential backward to reduce peak VRAM
 - SDPA and experimental sublinear local+landmark attention backends
 - profiling tools for memory, throughput, AR cost, SAT cost, and optimizer cost
 - synthetic long-context curriculum generation for recall and multi-hop tests
 Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
 recipes. The current sublinear backend is intentionally experimental: profile it
 against SDPA before using it for a real run.

 - longer block-size work on 24GB, B200, and B300 class GPUs
 - AR+SAT every step with sequential backward to reduce peak VRAM
 - SDPA and experimental sublinear local+landmark attention backends
+- exact M-fold expansion attention harvested from n1.py, with local verifier
 - profiling tools for memory, throughput, AR cost, SAT cost, and optimizer cost
 - synthetic long-context curriculum generation for recall and multi-hop tests
 Start with [AGILLM-4.md](AGILLM-4.md) for the training plan and command
 recipes. The current sublinear backend is intentionally experimental: profile it
 against SDPA before using it for a real run.
+Current harvest status from n1.py is tracked in [N1_HARVEST.md](N1_HARVEST.md).

local_sweep_after_mfold_sdpa.json ADDED Viewed

	@@ -0,0 +1,22 @@

+[
+  {
+    "alloc_gb": 0.027,
+    "amp": false,
+    "attn_backend": "sdpa",
+    "batch_size": 1,
+    "block": 64,
+    "elapsed_s": 0.534,
+    "error": null,
+    "grad_checkpoint": true,
+    "loss": 18.8005,
+    "ok": true,
+    "peak_alloc_gb": 0.031,
+    "peak_reserved_gb": 0.057,
+    "reserved_gb": 0.057,
+    "sublinear_chunk": 128,
+    "sublinear_max_anchors": 256,
+    "sublinear_stride": 64,
+    "sublinear_window": 256,
+    "tokens_per_s_synthetic": 119.8
+  }
+]

local_sweep_after_mfold_sublinear.json ADDED Viewed

	@@ -0,0 +1,22 @@

+[
+  {
+    "alloc_gb": 0.027,
+    "amp": false,
+    "attn_backend": "sublinear",
+    "batch_size": 1,
+    "block": 64,
+    "elapsed_s": 0.635,
+    "error": null,
+    "grad_checkpoint": true,
+    "loss": 18.7676,
+    "ok": true,
+    "peak_alloc_gb": 0.031,
+    "peak_reserved_gb": 0.057,
+    "reserved_gb": 0.057,
+    "sublinear_chunk": 16,
+    "sublinear_max_anchors": 16,
+    "sublinear_stride": 8,
+    "sublinear_window": 16,
+    "tokens_per_s_synthetic": 100.8
+  }
+]

local_verify_m_fold_agillm4.json ADDED Viewed

	@@ -0,0 +1,118 @@

+[
+  {
+    "backend": "manual",
+    "d": 32,
+    "dk": 16,
+    "expected_k_width": 16,
+    "heads": 2,
+    "ok": true,
+    "preset": "pico_1x",
+    "rank": 16,
+    "rows": {
+      "cache_k_width": 16.0,
+      "cache_v_width": 16.0,
+      "cached_append_forward": 5.960464477539063e-08,
+      "causal_alibi_forward": 0.0,
+      "causal_alibi_loss": 0.0,
+      "causal_alibi_param_grad": 0.0,
+      "causal_alibi_x_grad": 0.0,
+      "none_forward": 0.0,
+      "none_loss": 0.0,
+      "none_param_grad": 0.0,
+      "none_x_grad": 0.0,
+      "sat_alibi_forward": 0.0,
+      "sat_alibi_loss": 0.0,
+      "sat_alibi_param_grad": 0.0,
+      "sat_alibi_x_grad": 0.0
+    },
+    "tol": 0.0002
+  },
+  {
+    "backend": "sdpa",
+    "d": 32,
+    "dk": 16,
+    "expected_k_width": 16,
+    "heads": 2,
+    "ok": true,
+    "preset": "pico_1x",
+    "rank": 16,
+    "rows": {
+      "cache_k_width": 16.0,
+      "cache_v_width": 16.0,
+      "cached_append_forward": 5.960464477539063e-08,
+      "causal_alibi_forward": 7.450580596923828e-08,
+      "causal_alibi_loss": 0.0,
+      "causal_alibi_param_grad": 1.862645149230957e-09,
+      "causal_alibi_x_grad": 2.3283064365386963e-10,
+      "none_forward": 8.940696716308594e-08,
+      "none_loss": 0.0,
+      "none_param_grad": 9.313225746154785e-10,
+      "none_x_grad": 6.548361852765083e-11,
+      "sat_alibi_forward": 1.1920928955078125e-07,
+      "sat_alibi_loss": 1.862645149230957e-09,
+      "sat_alibi_param_grad": 9.313225746154785e-10,
+      "sat_alibi_x_grad": 3.4924596548080444e-10
+    },
+    "tol": 0.0002
+  },
+  {
+    "backend": "manual",
+    "d": 128,
+    "dk": 16,
+    "expected_k_width": 16,
+    "heads": 8,
+    "ok": true,
+    "preset": "micro_3x",
+    "rank": 48,
+    "rows": {
+      "cache_k_width": 16.0,
+      "cache_v_width": 16.0,
+      "cached_append_forward": 6.51925802230835e-08,
+      "causal_alibi_forward": 5.960464477539063e-08,
+      "causal_alibi_loss": 0.0,
+      "causal_alibi_param_grad": 4.656612873077393e-10,
+      "causal_alibi_x_grad": 5.820766091346741e-11,
+      "metric_cache_cleared_on_train": 0.0,
+      "metric_cache_reused": 0.0,
+      "none_forward": 5.960464477539063e-08,
+      "none_loss": 0.0,
+      "none_param_grad": 2.3283064365386963e-10,
+      "none_x_grad": 2.1827872842550278e-11,
+      "sat_alibi_forward": 1.1920928955078125e-07,
+      "sat_alibi_loss": 1.862645149230957e-09,
+      "sat_alibi_param_grad": 5.820766091346741e-10,
+      "sat_alibi_x_grad": 7.275957614183426e-11
+    },
+    "tol": 0.0002
+  },
+  {
+    "backend": "sdpa",
+    "d": 128,
+    "dk": 16,
+    "expected_k_width": 16,
+    "heads": 8,
+    "ok": true,
+    "preset": "micro_3x",
+    "rank": 48,
+    "rows": {
+      "cache_k_width": 16.0,
+      "cache_v_width": 16.0,
+      "cached_append_forward": 7.450580596923828e-08,
+      "causal_alibi_forward": 1.043081283569336e-07,
+      "causal_alibi_loss": 0.0,
+      "causal_alibi_param_grad": 4.656612873077393e-10,
+      "causal_alibi_x_grad": 9.458744898438454e-11,
+      "metric_cache_cleared_on_train": 0.0,
+      "metric_cache_reused": 0.0,
+      "none_forward": 8.940696716308594e-08,
+      "none_loss": 0.0,
+      "none_param_grad": 2.3283064365386963e-10,
+      "none_x_grad": 2.9103830456733704e-11,
+      "sat_alibi_forward": 1.1920928955078125e-07,
+      "sat_alibi_loss": 0.0,
+      "sat_alibi_param_grad": 6.984919309616089e-10,
+      "sat_alibi_x_grad": 1.1641532182693481e-10
+    },
+    "tol": 0.0002
+  }
+]

nB300_agillm4.py CHANGED Viewed

@@ -1154,6 +1154,15 @@ class TuneableAttentionMHA(nn.Module):
         nn.init.orthogonal_(self.U)
         self.proj = nn.Linear(h * self.dk, d, bias=False)
         self.drop = nn.Dropout(0.1)
     def _proj_qk(self, x):
         B, N, _ = x.shape
@@ -1163,6 +1172,44 @@ class TuneableAttentionMHA(nn.Module):
         B, N, _ = x.shape
         return x.view(B, N, self.h, self.dk).transpose(1, 2)
     def _sublinear_attention(self, q, k, v, attn_mask=None):
         """Local-window + landmark attention: O(N * (window + N/stride))."""
         bsz, heads, q_len, _ = q.shape
@@ -1226,9 +1273,15 @@ class TuneableAttentionMHA(nn.Module):
         return torch.cat(outputs, dim=2)
     def forward(self, x, mask=None, rel_bias_tokens=None, kv_cache=None, use_cache=False):
-        q = self._proj_qk(self.q(x))
-        k_new = self._proj_qk(self.k(x))
         v_new = self._reshape_v(self.v(x))
         if kv_cache is None:
             k, v = k_new, v_new
         else:

         nn.init.orthogonal_(self.U)
         self.proj = nn.Linear(h * self.dk, d, bias=False)
         self.drop = nn.Dropout(0.1)
+        # Exact n1 harvest: for expansion ranks, (q @ U) @ (k @ U).T is
+        # q @ (U @ U.T) @ k.T. This keeps score/cache width at d_k with no
+        # quality change. Inference caches the metric and training recomputes
+        # it so gradients through U are unchanged.
+        self._metric_cache: Optional[torch.Tensor] = None
+        self._metric_cache_ver: int = -1
+        self._metric_cache_param_id: int = -1
+        self._metric_cache_data_ptr: int = -1
+        self._metric_cache_shape: Tuple[int, int] = (-1, -1)
     def _proj_qk(self, x):
         B, N, _ = x.shape
         B, N, _ = x.shape
         return x.view(B, N, self.h, self.dk).transpose(1, 2)
+    def _reshape_heads(self, x):
+        B, N, _ = x.shape
+        return x.view(B, N, self.h, self.dk).transpose(1, 2)
+    def _get_metric(self) -> torch.Tensor:
+        if torch.is_grad_enabled():
+            return self.U @ self.U.T
+        cur_ver = self.U._version
+        cur_param_id = id(self.U)
+        cur_data_ptr = int(self.U.data_ptr())
+        cur_shape = tuple(self.U.shape)
+        cache = self._metric_cache
+        if (
+            cache is None
+            or cache.dtype != self.U.dtype
+            or cache.device != self.U.device
+            or self._metric_cache_ver != cur_ver
+            or self._metric_cache_param_id != cur_param_id
+            or self._metric_cache_data_ptr != cur_data_ptr
+            or self._metric_cache_shape != cur_shape
+        ):
+            cache = (self.U @ self.U.T).detach()
+            self._metric_cache = cache
+            self._metric_cache_ver = cur_ver
+            self._metric_cache_param_id = cur_param_id
+            self._metric_cache_data_ptr = cur_data_ptr
+            self._metric_cache_shape = cur_shape
+        return cache
+    def train(self, mode: bool = True):
+        if mode:
+            self._metric_cache = None
+            self._metric_cache_ver = -1
+            self._metric_cache_param_id = -1
+            self._metric_cache_data_ptr = -1
+            self._metric_cache_shape = (-1, -1)
+        return super().train(mode)
     def _sublinear_attention(self, q, k, v, attn_mask=None):
         """Local-window + landmark attention: O(N * (window + N/stride))."""
         bsz, heads, q_len, _ = q.shape
         return torch.cat(outputs, dim=2)
     def forward(self, x, mask=None, rel_bias_tokens=None, kv_cache=None, use_cache=False):
+        q_lin = self.q(x)
+        k_lin = self.k(x)
         v_new = self._reshape_v(self.v(x))
+        if self.r > self.dk:
+            q = self._reshape_heads(q_lin) @ self._get_metric()
+            k_new = self._reshape_heads(k_lin)
+        else:
+            q = self._proj_qk(q_lin)
+            k_new = self._proj_qk(k_lin)
         if kv_cache is None:
             k, v = k_new, v_new
         else:

verify_m_fold_agillm4.py ADDED Viewed

	@@ -0,0 +1,162 @@

+#!/usr/bin/env python3
+from __future__ import annotations
+import argparse
+import json
+import math
+import os
+from pathlib import Path
+os.environ.setdefault("AGILLM_SYNTHETIC_TOKENIZER", "1")
+import torch
+import nB300_agillm4 as nb
+def causal_mask_cached(new_len: int, cached_len: int):
+    total = cached_len + new_len
+    q_pos = torch.arange(cached_len, total, device=nb.DEV).view(new_len, 1)
+    k_pos = torch.arange(total, device=nb.DEV).view(1, total)
+    mask = torch.where(k_pos > q_pos, float("-inf"), 0.0)
+    return mask.view(1, 1, new_len, total)
+def old_expanded_forward(mha: nb.TuneableAttentionMHA, x: torch.Tensor, mask=None, rel_bias_tokens=None):
+    bsz, seq, _ = x.shape
+    q = mha._reshape_heads(mha.q(x)) @ mha.U
+    k = mha._reshape_heads(mha.k(x)) @ mha.U
+    v = mha._reshape_v(mha.v(x))
+    att = (q @ k.transpose(-1, -2)) / math.sqrt(mha.dk)
+    if mha.use_relpos and rel_bias_tokens is not None:
+        att = att + nb.alibi_bias(mha.h, rel_bias_tokens)[:, :, -seq:, :]
+    if mask is not None:
+        att = att + mask
+    z = (att.softmax(-1) @ v).transpose(1, 2).reshape(bsz, seq, -1)
+    return mha.drop(mha.proj(z))
+def max_param_grad_diff(mha: nb.TuneableAttentionMHA, ref_grads: dict[str, torch.Tensor]) -> float:
+    out = 0.0
+    for name, param in mha.named_parameters():
+        if param.grad is None:
+            continue
+        out = max(out, (param.grad.detach() - ref_grads[name]).abs().max().item())
+    return out
+def verify_case(args, preset: str, backend: str) -> dict:
+    torch.manual_seed(args.seed)
+    cfg = nb.PRESETS[preset].copy()
+    d, h, r = cfg["d"], cfg["heads"], cfg["rank"]
+    seq = args.cached_len + args.new_len
+    mha = nb.TuneableAttentionMHA(d, h, r, attn_backend=backend).to(nb.DEV).eval()
+    rows = {}
+    for case_name, mask, rel_tokens in [
+        ("none", None, None),
+        ("causal_alibi", nb.causal_mask(seq), seq),
+        ("sat_alibi", nb.sat_mask(seq), seq),
+    ]:
+        x_new = torch.randn(2, seq, d, device=nb.DEV, requires_grad=True)
+        x_old = x_new.detach().clone().requires_grad_(True)
+        y_new = mha(x_new, mask, rel_bias_tokens=rel_tokens)
+        y_old = old_expanded_forward(mha, x_old, mask=mask, rel_bias_tokens=rel_tokens)
+        loss_new = y_new.square().mean()
+        loss_old = y_old.square().mean()
+        loss_new.backward()
+        new_x_grad = x_new.grad.detach().clone()
+        new_param_grads = {
+            name: param.grad.detach().clone()
+            for name, param in mha.named_parameters()
+            if param.grad is not None
+        }
+        mha.zero_grad(set_to_none=True)
+        loss_old.backward()
+        old_x_grad = x_old.grad.detach().clone()
+        rows[f"{case_name}_forward"] = (y_new - y_old).abs().max().item()
+        rows[f"{case_name}_loss"] = abs(loss_new.item() - loss_old.item())
+        rows[f"{case_name}_x_grad"] = (new_x_grad - old_x_grad).abs().max().item()
+        rows[f"{case_name}_param_grad"] = max_param_grad_diff(mha, new_param_grads)
+        mha.zero_grad(set_to_none=True)
+    with torch.no_grad():
+        prefix = torch.randn(1, args.cached_len, d, device=nb.DEV)
+        append = torch.randn(1, args.new_len, d, device=nb.DEV)
+        full = torch.cat([prefix, append], dim=1)
+        y_full = mha(full, nb.causal_mask(seq), rel_bias_tokens=seq)[:, args.cached_len :]
+        _, kvs = mha(prefix, nb.causal_mask(args.cached_len), rel_bias_tokens=args.cached_len, use_cache=True)
+        y_cached, kvs2 = mha(
+            append,
+            causal_mask_cached(args.new_len, args.cached_len),
+            rel_bias_tokens=seq,
+            kv_cache=kvs,
+            use_cache=True,
+        )
+        k_cached, v_cached = kvs2
+        rows["cached_append_forward"] = (y_full - y_cached).abs().max().item()
+        rows["cache_k_width"] = float(k_cached.size(-1))
+        rows["cache_v_width"] = float(v_cached.size(-1))
+        if r > d // h:
+            _ = mha(prefix, None, use_cache=False)
+            first_cache = mha._metric_cache
+            _ = mha(prefix, None, use_cache=False)
+            second_cache = mha._metric_cache
+            rows["metric_cache_reused"] = 0.0 if (first_cache is second_cache and first_cache is not None) else 1.0
+            mha.train(True)
+            rows["metric_cache_cleared_on_train"] = 0.0 if mha._metric_cache is None else 1.0
+    tol = args.tol
+    ok = True
+    numeric_rows = {}
+    for key, value in rows.items():
+        if key in {"cache_k_width", "cache_v_width"}:
+            numeric_rows[key] = value
+            continue
+        numeric_rows[key] = value
+        ok = ok and value <= tol
+    expected_k_width = d // h if r > (d // h) else r
+    ok = ok and int(rows["cache_k_width"]) == expected_k_width
+    ok = ok and int(rows["cache_v_width"]) == d // h
+    return {
+        "preset": preset,
+        "backend": backend,
+        "d": d,
+        "heads": h,
+        "rank": r,
+        "dk": d // h,
+        "expected_k_width": expected_k_width,
+        "ok": ok,
+        "tol": tol,
+        "rows": numeric_rows,
+    }
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Verify AGILLM-4 exact M-fold attention harvest from n1.py")
+    parser.add_argument("--presets", default="pico_1x,micro_3x")
+    parser.add_argument("--backends", default="manual,sdpa")
+    parser.add_argument("--cached_len", type=int, default=8)
+    parser.add_argument("--new_len", type=int, default=4)
+    parser.add_argument("--seed", type=int, default=1234)
+    parser.add_argument("--tol", type=float, default=2e-4)
+    parser.add_argument("--json_out", default="")
+    args = parser.parse_args()
+    results = []
+    all_ok = True
+    for preset in [item.strip() for item in args.presets.split(",") if item.strip()]:
+        for backend in [item.strip() for item in args.backends.split(",") if item.strip()]:
+            result = verify_case(args, preset, backend)
+            results.append(result)
+            all_ok = all_ok and result["ok"]
+            print(json.dumps(result, sort_keys=True), flush=True)
+    if args.json_out:
+        Path(args.json_out).write_text(json.dumps(results, indent=2, sort_keys=True), encoding="utf-8")
+    return 0 if all_ok else 1
+if __name__ == "__main__":
+    raise SystemExit(main())