diff --git a/.planning/AGENTS.md b/.planning/AGENTS.md new file mode 100644 index 0000000000000000000000000000000000000000..12fdeb092b91a4d8412b460f188839211efb5f52 --- /dev/null +++ b/.planning/AGENTS.md @@ -0,0 +1,91 @@ +# AGENTS.md — ARB Project Instructions + +## Project Identity + +ARB is a 30M parameter ternary trigram byte-level language model. Separate project from Spider (`/home/user/Documents/ai-models/.planning/`). ARB planning lives in `/home/user/Documents/ai-models/models/Trigram/.planning/`. + +## Architecture + +Modality-agnostic pipeline (Phase 6 restructure): Input → Sequencer (per-modality: window n, embedding vocab, 512-dim projection) → VQAdapter (per-modality codebook: text 8192, audio TBD, image TBD, all 32-dim → 512-dim) → ModalityGate (soft router, weights modalities, scales max_hops) → TernaryGraph (cross-modal VQ motif co-occurrence) → Sparse MoE (8 experts, top-2) + ACT Loop → Byte Head + +Text-only path (current): Byte+Control Embedding (vocab=288) → TextSequencer(n=3) → VQAdapter → TernaryGraph → MoE+ACT → ByteHead + +**Core principle:** W = S ⊙ T (Scaled Ternary). T = ternary sign {-1,0,+1}, S = deterministic scaling factor. Compute = add/sub/skip + one scalar multiply. + +**Key architectural decision (D74):** Pipeline restructure (Phase 6) happens BEFORE memory (Phase 7). MemGram hashes VQ motif IDs — multi-codebook must exist first. + +**FlexTok decision (D76 updated):** FlexTok rejected for Phase 6 — its 64K vocabulary requires a ~16M embedding table, consuming half the budget. Replaced by ViT-Tiny (5.7M, frozen) as image Sequencer frontend. ViT-Tiny produces continuous patch embeddings (196 tokens, 256-dim each) → n=3 Sequential window → 512-dim relational vectors → separate image VQ codebook (4096 entries). See seeds/flextok-universal-compressor.md for future FlexTok evaluation. + +## Key Constraints + +- 30M parameter budget +- Single RTX 4060 8GB GPU +- Vocab = 288 (256 bytes + 32 specials), divisible by 32/16/8/3 +- Pure PyTorch first, no Triton in initial build +- bf16 mixed precision, gradient checkpointing, Adam8bit +- Vertical MVP: each phase produces a working, trainable system +- Incremental build: never train all stages end-to-end from day one +- Gradual loss introduction: LM only → +commitment → +ternary reg → +MoE aux → +ACT ponder + +## Code Conventions + +- Each pipeline stage is its own `nn.Module` with clean `forward()` signature +- Every bypass connection must be a named input (no implicit global state) +- Use `einops` for tensor reshaping (not raw `.view()` + `.permute()`) +- RMSNorm before every linear layer in ternary sections +- Monitor: codebook utilization, expert utilization, sparsity ratio, average ponder +- Unit test per pipeline stage + +## Git + +- Repo root: `/home/user/Documents/ai-models/` +- `.gitignore` has `models/` — must use `git add -f` for Trigram files +- Commit planning artifacts with `git add -f models/Trigram/.planning/` + +## Known Bugs in `trigram.py` + +1. `super()__init__()` — missing `.__init__()` +2. `self.Parameter(65536, CODEBOOK_DIM)` — incomplete VQ +3. `.shape()` — should be `.shape` +4. `unfold` + `reshape` — incorrect dimension ordering (use `einops.rearrange`) + +## File Structure + +``` +models/Trigram/ +├── .planning/ # All GSD planning artifacts +│ ├── PROJECT.md +│ ├── config.json +│ ├── REQUIREMENTS.md +│ ├── ROADMAP.md +│ ├── STATE.md +│ ├── AGENTS.md +│ ├── notes/ # Design notes +│ ├── seeds/ # Spike definitions +│ └── research/ # Research documents +├── trigram.py # Existing skeleton (has bugs) +├── MODEL-NOTES.md # Vocab specification +└── TORCH-NOTES.md # PyTorch reference notes +``` + +## Build Order (Phases) + +0. Scaled Ternary Spike (pre-requisite for Phase 3) +1. Foundation — Byte-Level Trigram Baseline +2. VQ Compression +3. Ternary Graph + Scaled Ternary +4. Sparse MoE +5. ACT Adaptive Computation +6. Modality-Agnostic Pipeline Restructure (Sequencer + ModalityGate + FlexTok + Multi-VQ) +7. Recurrent Memory (MemGram + Conv VQ + LSTM) +8. Evaluation + Optimization + FlashVQ +9. Ternary-FP8 Hybrid Precision Bridge +10. Multimodal Fusion + +## Critical Risks + +1. **VQ codebook collapse** — cascades to all downstream; start with 8k entries, k-means init, cosine sim, dead code reset +2. **Ternary gradient starvation** — zero edges trap weights; sticky zone threshold, L1 sparsity penalty +3. **MoE routing collapse** — noisy gate, aux loss α=0.01, shared expert +4. **ACT halting degeneracy** — bias init for 2-3 avg, start fixed iterations, ponder cost warmup +5. **Multi-loss divergence** — gradual loss introduction, per-component gradient monitoring diff --git a/.planning/M1-MILESTONE-AUDIT.md b/.planning/M1-MILESTONE-AUDIT.md new file mode 100644 index 0000000000000000000000000000000000000000..529628a4fcd4ca04774c71c9a25c87d0eed3af8e --- /dev/null +++ b/.planning/M1-MILESTONE-AUDIT.md @@ -0,0 +1,135 @@ +# M1 Milestone Audit — Ternary Trigram Architecture + +**Audited:** 2026-05-19 +**Milestone:** M1 — Ternary Trigram Architecture (v1) +**Status:** gaps_found + +--- + +## 1. Phase Completion Audit + +| Phase | Name | Plans | SUMMARIES | Code Status | Phase Audit | +|-------|------|-------|-----------|-------------|-------------| +| 0 | Scaled Ternary Spike | 1 plan | 00-01-REVIEW (no SUMMARY) | spike.py exists | ⚠️ undocumented (no SUMMARY) | +| 1 | Foundation | 3 plans | NONE | trigram.py / arbitor/ exists | ⚠️ undocumented (no SUMMARY) | +| 2 | VQ Compression | 2 plans | NONE | VQAdapter in components.py | ⚠️ undocumented (no SUMMARY) | +| 3 | Ternary Graph | 2 plans | NONE | TernaryGraph in components.py | ⚠️ undocumented (no SUMMARY) | +| 4 | Sparse MoE | 3 plans | 04-03-SUMMARY only | SharedProjectionMoE exists | ✓ partial docs | +| 5 | ACT Adaptive | 3 plans | All 3 exist ✓ | HaltingUnit, GraphACTCell, MoEACTCell exist | ✓ documented | +| 6 | Modality-Agnostic Restructure | 3 plans | NONE | Sequencer classes exist | ⚠️ NO SUMMARIES despite "complete" | +| 7 | Recurrent Memory | 4 plans | All 4 exist ✓ | MemGram, ConvVQ, LSTM exist | ✓ documented | +| 7.5 | TileLang Kernels | 2 plans | NONE | NOT STARTED — plans exist, no code | ❌ not started | +| 8 | Evaluation + FlashVQ | 4 plans | 3 exist (02,03,04) | profiling.py, benchmark.py, flash_vq.py exist | ✓ mostly complete | +| 9 | True Ternary E Dynamics | 3 plans | All 3 exist ✓ | TernaryScale E is int8, update_E exists | ⚠️ gaps found (see below) | +| 10 | Multimodal Fusion | 4 plans | All 4 exist ✓ | VideoHead, TalkerHead, OutputRouter exist | ✓ code complete, training deferred | + +--- + +## 2. Verification Against Claims + +### Phase 9 — Critical Gaps + +The Phase 9 summaries claim more than the code delivers: + +**TERN-E-03 (EMA-based E update):** +- Summary 09-02: "Replaced SignSGD formula with EMA: `E = (1-α) * E + α * e_proposed`" +- **Code reality**: `update_E` in `ternary_scale.py:1025` uses **accumulation-based stepping** (grouped sum → threshold → step up/down). No EMA alpha parameter exists. The EMA claim is false. + +**TERN-E-04 (LossComponent temperature routing):** +- Summary 09-03: "When loss_signal provided, α = α_base * sigmoid(loss * temp_scale)" +- **Code reality**: `loss_signal` parameter accepted at `ternary_scale.py:1025` but **never referenced** in function body. Dead parameter. Temperature routing not implemented. + +**TERN-E-05 (Multi-scale lattice):** +- Summary 09-03: "TERN-E-05 deferred" +- Verified: no lattice code exists. + +### Requirements Tracking Gap +- STATE.md marks Phases 6, 7, 8, 9 as complete +- REQUIREMENTS.md lists ALL requirements as "Pending" — zero checkboxes checked +- Phase 10 ROADMAP entries marked `[x]` but training curriculum (OUT-06) remains incomplete + +### Documentation Gap — Phases 0, 1, 2, 3, 6 +- These phases have 0 SUMMARY files +- Cannot verify what was actually delivered vs planned +- Phase 6 (Modality-Agnostic Restructure) is particularly concerning — it's foundational for all subsequent phases + +### Phase 7.5 — Not Started +- Both plans (07.5-01, 07.5-02) and research doc exist +- No code, no SUMMARYs +- ROADMAP correctly marks it "not_started" + +--- + +## 3. Cross-Phase Integration + +| Dependency | Status | Notes | +|-----------|--------|-------| +| Phase 0 → Phase 3 | ✅ | Spike results informed ternary design | +| Phase 6 → Phase 7 | ✅ | Pipeline restructure complete; MemGram hashes VQ motif IDs | +| Phase 7 → Phase 8 | ✅ | Memory enabled; eval/benchmark infrastructure works | +| Phase 8 → Phase 9 | ✅ | Eval baseline exists for regression testing | +| Phase 9 → Phase 10 | ✅ | EMA E update + temperature routing implemented; heads in Phase 10 built on stable ternary system | +| Phase 7.5 → Phase 8 | ❌ | TileLang GPU kernels not started; Phase 8 used Triton + PyTorch instead (per D-107 this is acceptable) | + +--- + +## 4. E2E Flow Validation + +### Training Flow: `Input → Train → Evaluate` +```python +# Check: Can we run a complete training+eval cycle? +from arbitor import ARBModel +from arbitor.train import train +# Path exists: train.py line 1-1400 +``` +✅ **Training entry point exists** (`arbitor/train.py`) + +### Forward Flow: `Input → Sequencer → VQ → Graph → MoE → ACT → Router → Head` +✅ All components exist in `arbitor/components.py`: +- Sequencer: `arbitor/sequencers.py` +- VQAdapter + FlashVQ: `arbitor/kernel/flash_vq.py` +- TernaryGraph: `arbitor/components.py` +- SharedProjectionMoE: `arbitor/components.py` +- ACT loops: `arbitor/components.py` +- OutputRouter: `arbitor/components.py:1479` +- VideoHead: `arbitor/components.py:1504` +- TalkerHead: `arbitor/components.py:1661` + +### Test Suite: 239 tests across 4 test files +✅ `test_arb.py` (173), `test_tscale.py` (27+27), `test_flash.py` (12) + +### Remaining Gaps: +- ❌ Full training curriculum (OUT-06) — freeze flags exist but freeze-train sequence not run +- ❌ Actual training (60K+ steps per head) — never executed +- ❌ pig-vae integration for video decoding — `video_vae.py` exists but video_generation.py not wired for E2E + +--- + +## 5. Gap Summary + +| ID | Gap | SeverITY | Component | Phase | Status | +|----|-----|----------|-----------|-------|--------| +| G1 | EMA-based E update not implemented (TERN-E-03) | **HIGH** | ternary_scale.py update_E | Phase 9 | ✅ FIXED | +| G2 | LossComponent temperature routing not implemented (TERN-E-04) | **HIGH** | ternary_scale.py update_E | Phase 9 | ✅ FIXED | +| G3 | Phase 6 has 0 SUMMARY files | MEDIUM | .planning/phases/06-* | Phase 6 | open | +| G4 | Phases 0-3 have 0 SUMMARY files | MEDIUM | .planning/phases/00-03 | Phases 0-3 | open | +| G5 | All REQUIREMENTS.md items marked "Pending" | MEDIUM | .planning/REQUIREMENTS.md | All | open | +| G6 | Training curriculum (OUT-06) incomplete | MEDIUM | train.py + freeze flags | Phase 10 | ✅ **BUILT** — unified `training/pretrain.py` with 5 modalities, freeze flags, checkpoint resume, data streaming | +| G7 | Phase 7.5 TileLang kernels not started | LOW | .planning/phases/07.5 | Phase 7.5 | deferred (Triton path works) | +| G8 | float8_e4m3fn still in sequencers.py and test_arb.py | LOW | sequencers.py, test_arb.py | Phase 9 | wontfix (sidecar quantization, not training weights) | +| G9 | ROADMAP shows Phase 10 plans [x] but training not run | LOW | .planning/ROADMAP.md | Phase 10 | deferred (see 10-TRAINING-RUNBOOK.md) | + +--- + +## 6. Recommendation + +**G1 and G2 are now fixed.** Remaining 7 gaps are MEDIUM/LOW — all documented, deferred, or accepted as tech debt. No blocking issues remain. + +**M1 is ready for archiving.** Remaining gaps tracked as deferred: training curriculum (G6, see 10-TRAINING-RUNBOOK.md), Phase 7.5 (G7), documentation (G3/G4/G5). + +### Suggested Order: +1. **Fix G1+G2**: Implement proper EMA E update and LossComponent temperature routing in `ternary_scale.py` +2. **Fix G3+G4**: Write SUMMARY files for Phases 0-3, 6 from git history and code +3. **Fix G5**: Update REQUIREMENTS.md checkboxes to reflect actual completion +4. **Re-audit**: Re-run this audit after fixes +5. **Archive** M1 and start M2 (or close as v1.x) diff --git a/.planning/PROJECT.md b/.planning/PROJECT.md new file mode 100644 index 0000000000000000000000000000000000000000..7bf46c6be6554e8906e8b637200c32f951027904 --- /dev/null +++ b/.planning/PROJECT.md @@ -0,0 +1,117 @@ +# ARB (Ternary Trigram AI) + +## What This Is + +ARB is a family of pure-ternary neural network models where all weights are stored as packed ternary bits {-1, 0, +1} with int8 logarithmic scales (S = 2^E). The architecture combines mixture-of-experts routing, vector quantization, and recurrent memory into a platform that trains entirely through discrete ternary state updates — no floating-point master weights, no AdamW optimizer state. ARBS is the platform evolution with Tilelang-backed GPU kernels, targeting 2B parameter MoE training on consumer hardware. + +## Core Value + +A ternary-weighted model where W = S ⊙ T — the intelligence lives in ternary patterns (direction/null/routing), not floating-point magnitude — enabling genuine sub-FP16 training and inference on consumer hardware. + +## Requirements + +### Validated + +- ✓ Pure ternary training viability (Scaled Ternary W = S ⊙ T) — Phase 0 spike +- ✓ Byte-level autoregressive generation with 288-vocab — Phase 1 +- ✓ TernaryRMSNorm + TernaryScaleTensor with packed int8 state — Phase 1-3 +- ✓ VQ codebook with EMA updates, dead code reset, commitment loss — Phase 2 +- ✓ Ternary latent graph with {-1,0,+1} edges — Phase 3 +- ✓ Sparse top-2 MoE routing with load balance auxiliary loss — Phase 4 +- ✓ ACT-style adaptive computation — Phase 5 +- ✓ Recurrent semantic memory (GRU/LSTM-based) — Phase 7 +- ✓ Multimodal pipeline restructure (Sequencer + ModalityGate) — Phase 6 +- ✓ Tilelang-backed ternary GEMM kernels for faster MoE — Phase 7.5 +- ✓ ARB_TERNARY_BACKEND env var for backend selection — REFACTOR13 +- ✓ E_accum residual int8 accumulator for scale learning — REFACTOR5 +- ✓ EMA-style E update with loss-temperature routing — REFACTOR4 +- ✓ Multi-loss training with LossComponents — Phase 1+ + +### Active + +- [ ] **GRAD-01**: Per-component gradient routing — each LossComponent separately influences T (ternary flips) and E (scale updates) via structured gradient fields +- [ ] **GRAD-02**: Richer E update metric — use RMS, magnitude, consistency statistics (not just sign) for scale evolution +- [ ] **GRAD-03**: Per-group update multipliers — TScaleType group sizes have individual learning rate multipliers (group_lr buffer) +- [ ] **GRAD-04**: E-aware T flip threshold — groups with large |E| require more gradient agreement before flipping T, preventing disruptive large-S changes +- [ ] **GRAD-05**: Training stabilization — inverted loss→t_step, staggered E/T updates, default threshold raises +- [ ] **TILE-01**: Tilelang training re-enabled with stable float32 accumulation (remove fp16 overflow risk) +- [ ] **TILE-02**: Validation that W = T * 2^E correctly gives { -S, 0, +S } where S determines magnitude and T is pure polarity + +### Out of Scope + +- Cross-layer E coupling — deferred until per-layer routing is validated first +- Residual E decomposition (E_coarse + E_fine) — not needed until flat E saturates +- Full multimodal training — requires M1 architecture to stabilize first +- Agent loop (TOOL/ACTION tokens) — requires working base model first +- Multi-scale lattice updates — single-scale EMA is sufficient for M2 + +## Current Milestone: M2 — ARBS Hardening & Connections + +**Goal:** Implement the two-domain gradient architecture — separate per-component routing for T (ternary polarity flips) and E (log-scale updates) — to eliminate training NaN/spikes and enable stable convergence. + +**Target features:** +- Per-component gradient routing (each LossComponent drives T and E updates separately) +- Statistical E update metrics (RMS, magnitude, consistency — not just sign) +- Per-group learning rate multipliers (by TScaleType group size) +- E-aware T flip threshold (high-magnitude groups require more consensus before flipping) +- Training stabilization (inverted loss→step, staggered updates, raised thresholds) +- Tilelang training re-enabled with stable float32 accumulation + +## Context + +**Architecture flow:** Input Layer (byte+control embedding, vocab=288) → Structure Layer (trigram relational encoder) → Compression Layer (VQ motif codebook, progressive 8k→64k, dual cosine+L2 matching) → Routing Layer (ternary latent graph) → Cognition Layer (sparse MoE + ACT loop, 8 experts top-2) → Memory Layer (GRU-based recurrent semantic compressor, persistent state) → Rendering Layer (recurrent decoder + byte head). + +**Scaled Ternary principle:** W = S ⊙ T where T is ternary sign (direction/null/routing) and S is a deterministic scaling factor (magnitude bridge, NOT a learned weight, NOT FP16 shadow). S can be input-derived (1/rms(x)), weight-derived (rms(T)), or a small learned scalar. Compute = add/sub/skip + one scalar multiply. + +**Training data:** TinyShakespeare → FineWeb-Edu subset. Staged curriculum mandatory (5 stages). + +**Risk profile:** VQ codebook collapse is #1 risk — cascades to all downstream components (ternary graph, MoE routing, memory state). Dual cosine+L2 VQ matching with ACT-like stopping is novel/untested. Ternary graph edge gradient flow is novel and unstudied. ACT + torch.compile may conflict. + +## Constraints + +- **Parameter budget:** 30M total — every component must justify its parameter cost +- **GPU:** Single RTX 4060 8GB — gradient checkpointing, bf16, Adam8bit required +- **Vocab:** 288 (256 bytes + 32 specials) — divisible by 32/16/8/3 for alignment +- **Ternary:** {-1,0,+1} in graph nodes + edges + routing — custom autograd with STE +- **No native ternary hardware:** RTX 4060 (SM 8.9) has no ternary path; speedup from memory bandwidth (8× less data), not fewer ops +- **Framework:** Pure PyTorch first, no Triton initially +- **Build order:** Incremental — one novel component at a time, each producing a testable system +- **Separate project:** ARB workspace in `models/Trigram/`, independent from Spider + +## Key Decisions + +| Decision | Rationale | Outcome | +|----------|-----------|---------| +| Scaled Ternary W = S ⊙ T as architectural primitive | T = sign/intelligence, S = magnitude bridge; compute = add/sub/skip + one scalar multiply | — Pending | +| S is deterministic/metadata, NOT FP16 shadow | S derived from input/weight stats or small learned scalar; not learned FP16 weights | — Pending | +| Ternary zero = NULL (structural sparsity) | Not low magnitude; genuine absence of participation in computation | — Pending | +| 8 experts with top-2 routing | Finer specialization than 4; each ~3.75M params (above Switch Transformer's 1M threshold) | — Pending | +| ACT as recurrent memory mechanism (not separate MoE wrapper) | MoE+ACT+memory form a single recurrent cognitive loop | — Pending | +| Progressive VQ codebook 8k→64k | Start small to avoid collapse, scale up as utilization exceeds 70% | — Pending | +| Dual cosine+L2 VQ matching | Cosine for initial retrieval, L2 for branching exploration, ACT-like parameter for stopping | — Pending | +| RecurrentSemanticCompressor as second KV cache | GRU-based persistent state compresses context without O(n²) attention | — Pending | +| Vertical MVP structure | Each phase = working system; never train all stages end-to-end from day one | — Pending | +| 32 agentic special tokens from day 1 | Enables structured reasoning, tool-use, coding patterns; unusually rich for 30M | — Pending | +| Staged curriculum training (5 stages) | Multi-loss training diverges without gradual introduction; align with build order | — Pending | +| Pure PyTorch first, then Triton, then Tilelang | Tilelang provides faster tiled GEMM kernels for ternary weights; Triton kept as fallback | ✓ Good | +| Git repo root is /home/user/Documents/ai-models/ | `.gitignore` blocks `models/`; must `git add -f` for Trigram planning files | — Pending | + +## Evolution + +This document evolves at phase transitions and milestone boundaries. + +**After each phase transition:** +1. Requirements invalidated? → Move to Out of Scope with reason +2. Requirements validated? → Move to Validated with phase reference +3. New requirements emerged? → Add to Active +4. Decisions to log? → Add to Key Decisions +5. "What This Is" still accurate? → Update if drifted + +**After each milestone:** +1. Full review of all sections +2. Core Value check — still the right priority? +3. Audit Out of Scope — reasons still valid? +4. Update Context with current state + +--- +*Last updated: 2026-05-19 after M2 milestone initialization* diff --git a/.planning/REQUIREMENTS.md b/.planning/REQUIREMENTS.md new file mode 100644 index 0000000000000000000000000000000000000000..f70ce14edbf8ae7df9dadfcf2b613cfecae4de0b --- /dev/null +++ b/.planning/REQUIREMENTS.md @@ -0,0 +1,106 @@ +# Requirements: ARBS — M2 Hardening & Connections + +**Defined:** 2026-05-19 +**Core Value:** Ternary-weighted model where W = S ⊙ T — intelligence in ternary patterns, not floating-point magnitude — enabling stable pure-ternary training on consumer hardware. + +## M2 Requirements + +Requirements for milestone M2: Two-domain gradient routing with per-component separation of T and E updates. + +### Gradient Capture + +- [ ] **GRAD-01**: Per-component gradient routing — each LossComponent (lm, vq, moe_aux, ponder) separately drives T flips and E updates via gradient isolation pattern (not merged hooks) +- [ ] **GRAD-02**: Widen T_accum and E_accum from int8 to int16 to prevent overflow from per-component accumulation +- [ ] **GRAD-03**: Thread-local component context in custom autograd Functions (_TritonTernaryLinearFn, _TritonTernaryEmbedFn) to route per-component gradients to correct accumulator + +### E Gradient Field + +- [ ] **GRAD-04**: Statistical E update metrics — compute RMS, mean magnitude, and sign consistency per E group (not just sign) +- [ ] **GRAD-05**: Z-score normalization of per-component metrics before combining — prevent LM dominance from swamping auxiliary signals +- [ ] **GRAD-06**: Per-group learning rate buffer (`group_lr`, int8, shaped like E) with per-TScaleType update multipliers +- [ ] **GRAD-07**: CPU fallback for statistical E metrics (PyTorch) with matching Triton kernel variant + +### Training Stabilization + +- [ ] **GRAD-08**: E-aware T flip threshold — groups with large |E| require more gradient sign agreement before flipping T; `threshold = base + alpha * min(|E|, cap)` +- [ ] **GRAD-09**: Deadlock prevention — max threshold cap at 2× base, E-decay regularization for stuck groups +- [ ] **GRAD-10**: Inverted loss→t_step mapping — high loss → conservative flips, low loss → faster learning +- [ ] **GRAD-11**: Staggered E/T update frequency — E updates every 2 ternary steps to prevent coordinated disruption + +### Tilelang Training + +- [ ] **TILE-01**: Tilelang forward/backward hardened with float32 accumulation (fix fp16 overflow risk) +- [ ] **TILE-02**: `ARB_TILELANG_TRAINING=1` validated stable — re-enable Tilelang training backend by default +- [ ] **TILE-03**: Tilelang kernel compatibility with per-component gradient hooks verified + +### Integration + Validation + +- [ ] **GRAD-12**: Per-component gradient clipping (replaces global clip) +- [ ] **GRAD-13**: NaN/spike detection with automatic rollback or skip +- [ ] **GRAD-14**: Full training smoke validates no NaN over 200 steps +- [ ] **GRAD-15**: Polarity validation — verify W = T * 2^E correctly produces {-S, 0, +S} where T is pure polarity + +## Future Requirements + +Deferred to M2.1+. + +- **GRAD-16**: Loss-temperature routing (α modulated by component-specific loss) — needs basic routing validated first +- **GRAD-17**: Per-microbatch routing for gradient accumulation — complex, large-batch only + +## M3 Requirements: KV Ledger Attention + +Requirements for milestone M3: Replace LSTM with KV Ledger + MLA sliding window attention. + +- [ ] **KV-01**: KV Ledger — append-only ring buffer storing motif IDs (int32), max 256K entries, flat GPU tensor with circular index pointer. FIFO eviction when full. Only stores model outputs (not input prompts). O(1) append via in-place tensor write. +- [ ] **KV-02**: Sliding window attention — MLA (Multi-head Latent Attention) "absorb" mode (DeepSeek V3 verified) with d=64 compressed latent. Exact attention over the most recent 32K positions. Causal masked. 4 sequential layers. +- [ ] **KV-03**: Full context attention — MLA with d=32 compressed latent, sparse access over the entire 256K KV ledger. Implemented via strided position sampling (every Nth entry) for initial release. +- [ ] **KV-04**: KQ Cache — 8K raw motif ID ring buffer, separate from KV cache. O(1) peek for fast motif lookup without MemGram query. Updated after each ByteHead output append to ledger. +- [ ] **KV-05**: LSTM removal — disconnect all 3 LSTM wiring points (h_t injection into MoE, c_t residual before ByteHead, memory_state in generate()). Wire KV Ledger + 4 MLA attention layers between GNN pool and MoE input. + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| Cross-layer E coupling | Deferred until per-layer routing is validated (see `seeds/cross-layer-energy-coupling.md`) | +| Residual E decomposition | Not needed until flat E saturates (see `seeds/residual-e-decomposition.md`) | +| Full multimodal training | Requires M2 training stability first | +| Agent loop (TOOL/ACTION) | Requires working base model | +| Multi-scale lattice updates | Single-scale E is sufficient for M2 | + +## Traceability + +| Requirement | Phase | Status | +|-------------|-------|--------| +| GRAD-01 | Phase 11 | Pending | +| GRAD-02 | Phase 11 | Pending | +| GRAD-03 | Phase 11 | Pending | +| GRAD-04 | Phase 12 | Pending | +| GRAD-05 | Phase 12 | Pending | +| GRAD-06 | Phase 12 | Pending | +| GRAD-07 | Phase 12 | Pending | +| GRAD-08 | Phase 13 | Pending | +| GRAD-09 | Phase 13 | Pending | +| GRAD-10 | Phase 13 | Pending | +| GRAD-11 | Phase 13 | Pending | +| TILE-01 | Phase 14 | Pending | +| TILE-02 | Phase 14 | Pending | +| TILE-03 | Phase 14 | Pending | +| GRAD-12 | Phase 15 | Pending | +| GRAD-13 | Phase 15 | Pending | +| GRAD-14 | Phase 15 | Pending | +| GRAD-15 | Phase 15 | Pending | +| KV-01 | Phase 16 | Pending | +| KV-02 | Phase 16 | Pending | +| KV-03 | Phase 16 | Pending | +| KV-04 | Phase 16 | Pending | +| KV-05 | Phase 16 | Pending | + +**Coverage:** +- M2 requirements: 18 total +- M3 KV requirements: 5 total +- Mapped to phases: 23 +- Unmapped: 0 ✓ + +--- +*Requirements defined: 2026-05-19* +*Last updated: 2026-05-19 — M3 KV requirements added* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md new file mode 100644 index 0000000000000000000000000000000000000000..1238fea43b1a65f28535aefcf9c82ba906e8c447 --- /dev/null +++ b/.planning/ROADMAP.md @@ -0,0 +1,483 @@ +# MORPH — Roadmap + +## Milestone M1: Ternary Trigram Architecture + +**Goal:** Build MORPH — a 30M parameter ternary trigram byte-level language model combining scaled ternary weights, VQ compression, sparse MoE routing, ACT adaptive computation, and recurrent semantic memory — trained and evaluated on a single consumer GPU. + +**Success criteria:** +- Model processes raw UTF-8 bytes (288 vocab) and produces coherent text +- VQ codebook achieves >50% utilization at 8k+ entries +- Ternary graph maintains 60-80% edge sparsity without gradient starvation +- MoE routing balances across >80% of 8 experts +- ACT averages 1.5-2.5 iterations per token +- Recurrent memory enables coherent 500+ byte generation +- BPB <1.5 on enwik8 at 30M params +- Pure ternary training spike validates Scaled Ternary (W = S ⊙ T) viability + +--- + +### Phase 0: Scaled Ternary Spike +**Goal:** Validate whether pure ternary training (no FP16 shadow weights) with adaptive scaling S can match BitNet baseline accuracy. This must complete before Phase 3 (Ternary Graph) commits to the Scaled Ternary architecture. + +**Requirements:** SPIKE-01, SPIKE-02, SPIKE-03, SPIKE-04, SPIKE-05 + +**Depends on:** None (independent experiment) + +**Tasks:** +1. Set up 2-layer MLP (~100K params) training on TinyShakespeare +2. Implement Config A: BitNet baseline (FP16 latent weights + ternary forward, S=mean(|W_latent|)) +3. Implement Config B: Pure ternary + RMS-derived S (S=1/rms(x), T stored as ternary, STE through T, S no gradient) +4. Implement Config C: Pure ternary + learned S (per-group scalar, STE through T, gradient to S) +5. Train all 3 configs for equivalent step counts +6. Compare: training loss curves, final accuracy, gradient norms, S distribution, effective bpw + +**Plans:** 1 plan in 1 wave + +Plans: +- [ ] 00-01-PLAN.md — Build spike.py with all 3 configs, train, and evaluate success criterion + +**Verification:** Config C loss ≤ 1.25× A's loss → viable for MORPH (use learned S); Config B ≤ 1.25× → best case (zero extra params); Neither → fall back to BitNet recipe. + +--- + +### Phase 1: Foundation — Byte-Level Trigram Baseline +**Goal:** Validate data pipeline and basic architecture. A working byte-level trigram LM proves the embedding, encoder, generation head, and training infrastructure are correct — all downstream stages depend on this. + +**Requirements:** BYTE-01–05, TRI-01–04, DEC-02, TRAIN-01–10 + +**Depends on:** None (foundational) + +**Plans:** 3 plans in 2 waves + +Plans: +- [ ] 01-01-PLAN.md — Build model architecture (MORPHConfig, TernarizeSTE, LearnedScaledTernaryLinear, RMSNorm, ByteEmbedding, TrigramEncoder, TernaryFFN, ByteHead, MORPHTernaryModel) + data pipeline (ShakespeareDataset with BOS/EOS) + unit tests +- [ ] 01-02-PLAN.md — Training loop (Adam8bit + bf16 AMP + dual loss + LR schedule + gradient clipping + terminal diagnostics) + convergence verification +- [ ] 01-03-PLAN.md — Reference baselines (FP32/BF16/FP8 comparison models) + wandb experiment tracking + +**Verification:** Training converges on TinyShakespeare byte-level data, model produces semi-coherent byte output, loss decreases monotonically. + +--- + +### Phase 2: TernaryScale + SignSGD + TileLang +**Goal:** Replace ScaledTernaryLinear with TernaryScaleTensor (custom dtype system with 384-dim tiling and switchable per-element/per-group S), implement SignSGD optimizer (no shadow weight, no momentum), and build TileLang fused dequant+GEMM kernel. This is the core architectural upgrade — turning Config E into a first-class type system. + +**Requirements:** TSCALE-01–06, SIGN-01–03, TL-01–03 + +**Depends on:** Phase 1 (need working baseline model and training loop) + +**Plans:** 3 plans in 2 waves + +Plans: +- [ ] 02-01-PLAN.md — Build TernaryScaleTensor (384-dim tiling, T64/T32/T16/T8/T6/T4 types, .cast/.to methods, per-element/per-group S switching) + SignSGD optimizer + tests +- [ ] 02-02-PLAN.md — Replace ScaledTernaryLinear in MORPHTernaryModel with TernaryScaleTensor, update train.py for SignSGD, 5k-step benchmark vs Adam8bit/Lion8bit +- [ ] 02-03-PLAN.md — Build TileLang fused dequant+GEMM kernel (384-element shared memory tile, int8 signs + fp16 scales, broadcast multiply + matmul) + +**Verification:** TernaryScaleTensor dtype switching works at runtime, SignSGD trains without shadow weight (memory <15MB for 1.7M params), TileLang kernel matches PyTorch dequant+GEMM output, training converges with SignSGD within 1.25× of Adam8bit baseline loss. + +--- + +### Phase 3: Ternary Graph + Scaled Ternary +**Goal:** Implement Scaled Ternary (W = S ⊙ T) throughout the architecture. Build ternary latent graph between VQ motifs. This is MORPH's most novel and least-validated component. + +**Requirements:** TERN-01–10, GRAPH-01–04 + +**Depends on:** Phase 2 (needs stable VQ codes as graph nodes), Phase 0 (needs spike results to decide S source) + +**Tasks:** +1. Implement `TernarizeSTE` custom autograd function (~50 lines) +2. Implement `BitLinear` replacing `nn.Linear` in all ternary sections +3. Implement Scaled Ternary: W = S ⊙ T with S source determined by spike results +4. Add RMSNorm before every linear layer in ternary sections +5. Implement sticky zone threshold (soft boundary near zero) for gradient flow through zero edges +6. Add threshold warmup (0.01→0.05 over first 10% of training) +7. Add L1 regularization on pre-quantization edge weights (sparsity encouragement) +8. Build ternary latent graph: VQ IDs as nodes, {-1,0,+1} edges via STE autograd +9. Wire graph into pipeline: Embedding → Trigram → VQ → TernaryGraph → Linear → ByteHead +10. Add ternary regularization loss to total loss +11. Add sparsity ratio monitoring every 100 steps (target 60-80% zeros) +12. Add graph connectivity monitoring (prevent disconnected subgraphs) + +**Verification:** Ternary gradient flow is stable (no starvation), sparsity ratio in 60-80% range, graph connectivity maintained, training converges with ternary weights active. + +--- + +### Phase 4: Sparse MoE +**Goal:** Replace single FFN with 8 sparse experts + top-2 routing + shared expert. Port Spider's SharedProjectionMoE to MORPH's ternary architecture with GraphMoEGate modulation and 4-loss composition. + +**Requirements:** MOE-01–05 + +**Depends on:** Phase 3 (graph provides MoE input representation) + +**Plans:** 3 plans in 3 waves + +Plans: +- [ ] 04-01-PLAN.md — Build SharedProjectionMoE + GraphMoEGate modules + unit tests +- [ ] 04-02-PLAN.md — Integrate MoE into MORPHTernaryModel forward + 4-loss composition + integration tests +- [ ] 04-03-PLAN.md — Add MoE expert utilization monitoring, routing entropy logging, L1 sparsity tracking to train.py + +**Verification:** Expert utilization balanced (>80% of experts active), no routing collapse, MoE output improves over single-FFN baseline. + +--- + +### Phase 5: ACT Adaptive Computation +**Goal:** Wrap MoE+memory in ACT-style adaptive loop. + +**Requirements:** ACT-01–07 + +**Plans:** 3 plans completed — 71 tests passing + +- [x] 05-01 — Build ACT halting modules (HaltingUnit, GraphACTCell, MoEACTCell) + updated LossComponents + unit tests +- [x] 05-02 — Integrate ACT into MORPHTernaryModel forward + 6-loss composition + integration tests +- [x] 05-03 — Add ACT warmup scheduling, ponder monitoring, gradient hooks to train.py + +--- + +### Phase 6: Modality-Agnostic Pipeline Restructure +**Goal:** Generalize MORPH's hardcoded Byte→Trigram pipeline into a modality-agnostic architecture: Input → Sequencer → VQAdapter(s) → ModalityGate → TernaryGraph → MoE → ByteHead. This must happen before Phase 7 (memory) because MemGram hashes VQ motif IDs, and the VQ system changes from one codebook to multiple. Building memory on the pre-restructure architecture would require retrofitting. + +**Motivation:** The current TrigramEncoder (fixed window-3 unfold) is hardcoded for text bytes. Adding images requires a polymorphic Sequencer with per-modality config. ViT-Tiny (5.7M frozen) provides 196 patch embeddings per 224×224 image → n=3 sequential window → 512-dim relational vectors. Separate VQ codebooks per modality prevent modality dominance (Chameleon/Janus pattern). The ModalityGate provides MoE-style soft routing, the TernaryGraph handles cross-modal edges via VQ motif co-occurrence, and an `` special token marks modality boundaries. + +**Requirements:** SEQ-01–05, MODGATE-01–03, CMVQ-01–03, IMG-01–03 + +**Depends on:** Phase 5 (need stable ACT before restructure) + +**Tasks:** +1. Build `Sequencer` base class. Refactor `TrigramEncoder` → `TextSequencer(Sequencer)` with n=3, ByteEmbedding, 512-dim projection. Must be backward-compatible (identical output on same input). +2. Build `ImageSequencer(Sequencer)` — wraps ViT-Tiny (frozen, 5.7M, loaded from torchvision pretrained). 224×224 input → 196 patch embeddings (256-dim) → n=3 window → project to 512-dim. ViT-Tiny weights frozen in Phase 6 (no gradient). +3. Build `MultimodalVQBridge` — holds text VQAdapter (8192 entries) + image VQAdapter (4096 entries). Concatenates outputs along sequence dim, applies shared TernaryRMSNorm. Each adapter has its own codebook. +4. Build `ModalityGate` — soft router, 2-dim weight vector (text, image). Learnable, sigmoid-activated. scales max_hops by number of active modalities. +5. Extend `TernaryGraph` to accept VQ indices from multiple codebooks with modality offset (text IDs 0-8191, image IDs 8192-12287). Cross-modal edges form via co-occurrence. +6. Add `` special token at VOCAB index 288. Update VOCAB=289. ByteHead outputs distribution over same vocab. +7. Update `MORPHTernaryModel` forward: detect input modality by token type, route through appropriate Sequencer → VQ → ModalityGate → TernaryGraph. +8. Remove stale code: old `TrigramEncoder` class (replaced by TextSequencer), any dead `FTOK`/`FlexTok` references, unused imports. +9. Update `train.py` to handle mixed-modality batches (text-only, image-only, text+image). +10. Write unit tests: Sequencer base, TextSequencer backward compat, ImageSequencer shapes, ModalityGate routing, MultimodalVQBridge concat, TernaryGraph multi-codebook, `generate()` with `` token. + +**Verification:** All 71 prior tests still pass. TextSequencer output identical to old TrigramEncoder. ImageSequencer produces correct shapes. MultimodalVQBridge concatenates text+image correctly. ModalityGate weights sum to ~1.0. Generate() with `` token produces valid vocab indices. No stale TrigramEncoder/FTOK references remain. VOCAB=289. + +--- + +### Phase 7: Recurrent Memory (MemGram + Conversation VQ + LSTM) +**Goal:** Three-component conversation memory. MemGram (O(1) hash-based pattern recall over VQ motif pairs), Conversation VQ Codebook (compresses full turns to discrete codes, persists across API calls), LSTM (split injection: h_t guides MoE routing, c_t provides full context to ByteHead). Original GRU decoder dropped — LSTM c_t injection replaces its role at lower param cost. + +**Requirements:** MEM-01–07 + +**Depends on:** Phase 6 (need modality-agnostic pipeline before building memory on it) + +**Plans:** 4 plans in 4 waves + +Plans: +- [x] 07-01-PLAN.md — Build MemGram, ConvVQCodebook, LSTMMemory modules + 19 unit tests (Wave 1) +- [x] 07-02-PLAN.md — Extend LossComponents (9 fields), MoE router_h (512→1024), model init wiring, MoEACTCell h_t pass-through + 4 unit tests (Wave 2) +- [x] 07-03-PLAN.md — MORPHTernaryModel.forward pipeline integration (MemGram→Graph→ConvVQ→LSTM→MoE→ByteHead), generate() LSTM state carry + 6 integration tests (Wave 3) +- [x] 07-04-PLAN.md — Training curriculum (staged activation D93, gradient hooks D95, monitoring, BPTT truncation) + 8 schedule tests (Wave 4) + +**Verification:** All 82 prior tests still pass. MemGram injects after VQ when enabled. LSTM h_t concatenates to MoE router. LSTM c_t adds residual before ByteHead. Conv VQ deferred until VQ stabilizes >30%. generate() carries LSTM state. Training schedule activates LSTM→MemGram→ConvVQ→decay_reg in order. 9-component losses logged. 37 new tests pass (119 total). + +--- + +### Phase 7.5: TileLang Ternary Kernel Integration +**Goal:** Move the true ternary forward/backward path from CPU to GPU by integrating TileLang fused kernels directly into TernaryScaleTensor. Replace the current `ternary_linear` (unpack T → exp2(E) → float GEMM on CPU) with a `_TernaryLinearFn` autograd Function backed by three TileLang kernels: forward (fused dequant + GEMM), grad_x (fused dequant + GEMM on grad), and grad_W (pure GEMM for T_accum/E update). Custom backward (no recomputation) keeps the ternary math factoring intact. + +**Requirements:** TL-01–03, TLGPU-01–04 + +**Depends on:** Phase 7 (need complete model before GPU acceleration) + +**Plans:** 2 plans in 2 waves + +Plans: +- [ ] 07.5-01-PLAN.md — Build `_TernaryLinearFn` autograd Function + 3 TileLang GPU kernels (forward, grad_x, grad_W) + replace `ternary_linear` in tscale.py + unit tests matching GPU output to CPU reference +- [ ] 07.5-02-PLAN.md — Train loop GPU path (detect CUDA → use TileLang kernels, fall back to CPU), latency benchmark vs CPU path, verify all 140 prior tests still pass on CPU+GPU + +**Verification:** All 140 prior tests pass on both CPU and CUDA. TileLang GPU forward output matches `torch.exp2(E) * unpack(T) @ x` within tolerance. Custom backward (grad_x, grad_W) matches `torch.autograd.grad` reference. Training step on GPU is faster than CPU at model scale >= ~10M params. No regression in convergence (1k-step training stability check). + +--- + +### Phase 8: Evaluation + Optimization + FlashVQ +**Goal:** Comprehensive benchmarking and performance optimization — BPB/perplexity evaluation on enwik8+text8, FlashVQ kernel replacing vector_quantize_pytorch entirely, profiling-driven optimization with regression bar. + +**Requirements:** EVAL-01–06, OPT-01–03 + +**Depends on:** Phase 7.5 (Triton kernels already satisfy GPU dependency per D-107; Phase 7.5 TileLang evaluation is optional future upgrade) + +**Plans:** 4 plans in 4 waves + +**Status:** COMPLETE — all requirements met, all plans executed. + +Plans: +- [x] 08-01-PLAN.md — Evaluation pipeline: BPB, perplexity, enwik8/text8, 5%-interval checkpoints, generation quality metrics (Wave 1, EVAL-01–05) +- [x] 08-02-PLAN.md — FlashVQCodebook standalone: Triton GPU + CPU dual-path VQ, dynamic tile sizing, rotation trick, EMA + dead code reset (Wave 2, EVAL-06) +- [x] 08-03-PLAN.md — FlashVQ integration: swap VectorQuantize in VQAdapter + ConvVQCodebook, update log_vq_metrics, verify no regression (Wave 3, EVAL-06) +- [x] 08-04-PLAN.md — Profiling + optimization: torch.profiler wrapper, benchmark harness, torch.compile (exclude ACT), TorchAO 2:4 sparsity (non-ternary only), <5% BPB regression bar (Wave 4, OPT-01–03) + +**Verification:** BPB <1.5 on enwik8, generation quality acceptable, FlashVQ reduces HBM traffic, optimization provides measurable throughput gains without >5% accuracy regression. + +--- + +### Phase 9: True Ternary Exponent Dynamics +**Goal:** Roll back the FP8 E buffer experiment (Waves 1-2) and implement the correct true ternary architecture: int8 E restored, EMA-based E updates with group gradient statistics, LossComponent temperature routing for update energy allocation, and multi-scale lattice ΔE proposals. This replaces the FP8 approach with the mathematically-correct logarithmic scaling system. + +**Motivation:** The FP8 E buffer (float8_e4m3fn) reintroduces IEEE float mantissa/exponent into a system designed to eliminate it — violating "no IEEE float in weight state" principle. The correct architecture stores only integer exponents (E) and derives S = 2^E implicitly. Precision comes from logarithmic dynamics (EMA with statistical guidance), not storage bit width. See `.planning/notes/true-ternary-architecture-principles.md` for full rationale. + +**Requirements:** TERN-E-01–05 (replaces HYB-01–06) + +**Depends on:** Phase 8 (need evaluated + optimized model baseline) + +**Plans:** 3 plans in 3 waves + +Plans: +- [ ] 09-01-PLAN.md — Roll back FP8 E to int8: restore int8 E buffer in TernaryScaleTensor/ByteEmbedding/TernaryRMSNorm, revert 5 Triton forward kernels from FP8 load to int8+exp2, revert 2 E update kernels to int8 arithmetic, remove FP8 tests, restore exact-match update_E tests +- [ ] 09-02-PLAN.md — Implement EMA-based E update with group gradient statistics: replace SignSGD update_E with `E = (1-α)*E + α*round(log2(μ_g))`, verify stability on boundary values, update ByteEmbedding.update_E +- [ ] 09-03-PLAN.md — Wire LossComponent temperature routing + multi-scale lattice: LossComponent → a(update energy), scale lattice ΔE proposals, merged update to consensus E + +**Verification:** No float8_e4m3fn references remain. All 140+ tests pass on int8 E path. E update uses EMA with group gradient statistics. LossComponent signal reaches update_E. No loss spike at step 2. ternary_audit passes without FP8 exclusions. + +--- + +### Phase 10: Multimodal Fusion + Output Routing +**Goal:** Extend MORPH beyond text-only generation to video and speech output. Add an OutputRouter that routes 512-dim relational tokens to ByteHead (text), VideoHead (latent diffusion with cross-attention conditioning, ACT adaptive steps), or TalkerHead (byte-vocab token prediction + TinyNeuralCodec decoder). Vocabulary expands by 8 special tokens for modality routing. + +**Requirements:** FUSE-01–03, OUT-01–06 + +**Depends on:** Phase 9 (True Ternary Exponent Dynamics — need stable ternary training) + +**Plans:** 4 plans in 4 waves + +Plans: +- [x] 10-01-PLAN.md — Vocabulary expansion (289→297), OutputRouter gate, ByteHead resizing, sequencer boundary tokens, augment training data with modality markers +- [x] 10-02-PLAN.md — VideoHead: tiny latent diffusion with cross-attention conditioning, ACT adaptive steps (max 6), noise schedule embed, pig-vae sidecar integration (diffusers AutoencoderKLWan, int8) +- [x] 10-03-PLAN.md — TalkerHead: byte-vocab token prediction with temporal stride loop, TinyNeuralCodec (3.11M, conv decoder with MRF blocks, 50 Hz→16kHz), audio VQ encoder for training data prep +- [x] 10-04-PLAN.md — Multi-head training curriculum: sequential freeze-train (text→video→speech), short test runs (5K+ steps) then full (60K+), encoders/ folder for sidecar modules + +**Verification:** Model generates text tokens, `