y3i12 commited on Mar 7

Commit

56e82ec

1 Parent(s): 97022c0

Initial commit

Browse files

Files changed (17) hide show

README.md +307 -296
__init__.py +28 -0
bench.py +176 -0
coherence_eval.py +834 -0
config.py +306 -0
data.py +546 -0
generate.py +195 -0
graft_g2lu.py +300 -0
layers.py +325 -0
lm_eval_wrapper.py +344 -0
mirrored.py +532 -0
model.py +357 -0
scripts/__init__.py +0 -0
scripts/representation_analysis.py +1014 -0
scripts/spectral_analysis.py +969 -0
scripts/spectral_to_csv.py +202 -0
train.py +637 -0

README.md CHANGED Viewed

@@ -1,296 +1,307 @@
----
-license: mit
-datasets:
-- Bingsu/openwebtext_20p
-- HuggingFaceFW/fineweb-edu
-language:
-- en
-pipeline_tag: text-generation
----
-# Prisma
-A prototype model that is assembled as a mirrored transformer architecture with nested gating (adds an extra weight to the FFN) and morphological position encoding. It proposes that the model architecture creates different scaffolding, leading to different training regimens and capabilities.
-Prisma is only viable as it piggybacks on pre-trained tokenizers and their weight-tied embeddings, it decomposes the transformer architecture into symmetric **expand** and **compress** phases that share structural weights, connected by a small number of unique **middle** layers. Information expands from tokens to semantics, then compresses back — like light through a prism.
-```
-Token Embeddings
-      |
-  [ Expand  ]  ─── mirror pair 1 (W1, W2 shared) ── G²LU gate (W3·W4)
-  [ Expand  ]  ─── mirror pair 2
-  [  ....   ]  ─── mirror pair N
-      |
-  [ Middle  ]  ─── unique layers (full capacity, not shared)
-      |
-  [Compress ]  ─── mirror pair N (same W1, W2 as expand N)
-  [Compress ]  ─── mirror pair 2
-  [Compress ]  ─── mirror pair 1
-      |
-   LM Head (weight-tied to embeddings)
-```
-## Key Concepts
-**Mirrored layers.** Each expand layer shares W1 (projection) and W2 (output) weights with its corresponding compress layer. The architecture gets 2N virtual layers of processing from N unique parameter sets. At 357M parameters, Prisma runs 41 virtual layers from ~20 unique weight sets + 1 middle layer.
-**G²LU — Gated-Gated Linear Unit.** The gate is itself gated:
-Where typical gated transformers have `y  = W2 @ (W1 @ x * silu(W3 @ x))`, Prisma has:
-```python
-g4 = silu(W4 @ x)              # inner gate
-g3 = silu(W3 @ x * g4)         # outer gate, modulated by inner
-y  = W2 @ (W1 @ x * g3)        # gated output
-```
-One gate in function of the other. Creates quadratic (saddle-surface) decision boundaries instead of linear hyperplanes — each neuron computes a conjunction ("feature A AND feature B") rather than a single threshold. This produces narrow, separated activation channels that resist memorization and tolerate significantly higher learning rates. Part of the parameters saved with mirroring are re-distributed as W4.
-**WoRPE — Word-position Rotary Position Embedding.** Dedicates a small subspace of each attention head to encode position within a word (0 = prefix, 1 = second subword, ...). The information is already in the BPE tokenizer's word-boundary markers — WoRPE surfaces it geometrically so the model doesn't have to rediscover it. No new tokenizer required.
-**Auxiliary skip prediction.** An optional second head predicts t+K tokens ahead, providing gradient signal that rewards structural representations over local memorization. At K=1, functions as a dual-supervision regularizer through an untied projection.
-## Results
-### ~50M scale prototype (WikiText-103, 4 epochs)
-| Model | Params | LR | WikiText PPL | LAMBADA |
-|---|---|---|---|---|
-| Standard SwiGLU | 51M | 1e-4 | 4125 | 0.002 |
-| Prisma (G²LU) | 47M | 1e-4 | 2914 | 0.001 |
-| Prisma (G²LU + WoRPE) | 51M | 1e-2 | 921 | 0.082 |
-*Standard trained 10 epochs; Prisma (G²LU + WoRPE) shown at 1 epoch — the point is LR tolerance, not epoch-matched comparison.*
-The regularization stack (mirroring + G²LU + WoRPE) enables training at **100x the standard learning rate** without instability.
-### ~350M scale prototype — comparison with published models
-Prisma 357M trained on ~30B tokens (OpenWebText 20% + FineWeb-Edu 10BT continued training), compared against published models at similar scale:
-| Model | Params | Train Data | ARC-C\* | ARC-E\* | BoolQ | HellaSwag\* | LAMBADA | PIQA\* | WikiText\*\* | WinoGrande |
-|---|---|---|---|---|---|---|---|---|---|---|
-| GPT-2 medium | 355M | 40B | 0.250 | 0.436 | 0.586 | 0.394 | **0.430** | 0.664 | **26.75** | **0.531** |
-| Baguettotron | 321M | 200B | 0.302 | 0.506 | 0.589 | 0.354 | 0.294 | 0.618 | 30.93 | 0.530 |
-| SmolLM-360M | 360M | 600B | **0.359** | **0.640** | 0.550 | **0.536** | **0.455** | **0.715** | **19.49** | **0.570** |
-| SmolLM2-360M | 360M | 4000B | **0.381** | **0.681** | 0.617 | 0.431 | **0.532** | **0.718** | **15.67** | **0.586** |
-| LFM2-350M | 350M | 10000B | **0.393** | **0.662** | **0.642** | **0.489** | 0.399 | 0.698 | 25.68 | **0.558** |
-| **Prisma** | **357M** | **30B** | 0.290 | **0.548** | **0.620** | **0.427** | 0.362 | **0.670** | 27.40 | 0.506 |
-\* *normalized accuracy* · \*\* *word perplexity*
-**Key findings:**
-- **Beats GPT-2 medium on 5/8 benchmarks** (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA) with 25% less training data.
-- **Beats Baguettotron (200B) on 6/8 benchmarks** — including PPL — with **7x less data.**
-- **BoolQ 0.620** exceeds all models except LFM2 (10000B) and SmolLM2 (4000B). The anti-memorization properties of G²LU force genuine comprehension instead of statistical shortcuts.
-- **ARC-Easy 0.548** — the largest absolute gain over GPT-2 medium (+11.2pp). FineWeb-Edu knowledge absorbed efficiently through G²LU's relational features.
-- Prisma wins on **reasoning benchmarks** (ARC, HellaSwag, PIQA, BoolQ). Models trained on 20-300x more data win on **content prediction** (LAMBADA, PPL). The architecture trades raw memorization for data-efficient knowledge application.
-### Training progression (~350M)
-| Stage | LR | ARC-C | ARC-E | BoolQ | HellaSwag | LAMBADA | PIQA | PPL |
-|---|---|---|---|---|---|---|---|---|
-| Standard 336M baseline | 1e-4 | 0.228 | 0.341 | 0.618 | 0.280 | 0.226 | 0.574 | 77.2 |
-| Prisma 41L (OWT 20%) | 5e-4 | 0.238 | 0.394 | 0.585 | 0.317 | 0.313 | 0.614 | 44.8 |
-| + WoRPE (OWT 20%) | 1e-3 | 0.247 | 0.397 | 0.595 | 0.331 | 0.333 | 0.614 | 43.5 |
-| + continued (FineWeb c1) | 1e-3 | 0.249 | 0.434 | 0.601 | 0.333 | 0.312 | 0.626 | 34.7 |
-| + continued (FineWeb c2) | 1e-3 | 0.290 | 0.548 | 0.620 | 0.427 | 0.362 | 0.670 | 27.4 |
-## Quick Start
-### Install
-```bash
-pip install -r circuits/requirements.txt
-```
-### Train
-```bash
-# Small Prisma (~47M) on WikiText-103
-python -m circuits.train \
-  --arch mirrored --dims 384 --heads 6 --kv-heads 2 --layers 57 --n-middle 1 \
-  --tokenizer facebook/MobileLLM-125M \
-  --word-rope-dims 8 --word-rope-base 10.0 \
-  --data hf:wikitext:wikitext-103-raw-v1:train \
-  --epochs 4 --batch-size 32 --context-length 512 \
-  --lr 1e-2 --warmup-steps 500 --bf16 --gpu 0
-# 324M Prisma on OpenWebText
-python -m circuits.train \
-  --gpu 0 --compile --bf16 --arch mirrored \
-  --dims 1024 --heads 16 --kv-heads 4 --layers 41 --n-middle 1 \
-  --word-rope-dims 8 --word-rope-base 10.0 \
-  --tokenizer facebook/MobileLLM-125M \
-  --data hf:Bingsu/openwebtext_20p --text-column text \
-  --epochs 4 --batch-size 12 --context-length 1024 --grad-accum 42 \
-  --lr 1e-3 --warmup 500 \
-  --log-every 5 --val-every 1000 --save-every 1000 \
-  --checkpoint-dir path/to/checkpoints/your_model/
-```
-### Generate
-```bash
-python -m circuits.generate \
-  --checkpoint path/to/checkpoints/your_model/best.pt \
-  --prompt "A thought observing itself discovers that it" \
-  --max-tokens 256 --temperature 0.8 --top-p 0.95
-```
-### Benchmark
-```bash
-# Single model
-python -m circuits.bench \
-  --checkpoint path/to/checkpoints/your_model/best.pt \
-  --tasks arc_easy,lambada_openai,piqa,hellaswag,winogrande,wikitext \
-  --gpu 0
-```
-## CLI Reference
-### Architecture
-| Flag | Default | Description |
-|---|---|---|
-| `--arch` | `mirrored` | Architecture: `standard`, `mirrored`, `graft_g2lu` (experimental) |
-| `--dims` | 512 | Hidden dimension |
-| `--heads` | 8 | Number of attention heads |
-| `--kv-heads` | — | KV heads for GQA (omit = MHA) |
-| `--layers` | 12 | Total virtual layers (expand + middle + compress) |
-| `--n-middle` | 2 | Unique (non-mirrored) middle layers |
-### Prisma-Specific
-| Flag | Default | Description |
-|---|---|---|
-| `--word-rope-dims` | 0 | Head dims for WoRPE (0 = disabled, try 8) |
-| `--word-rope-base` | 10.0 | WoRPE frequency base |
-| `--aux-skip` | 0 | Skip-ahead prediction distance (0 = disabled) |
-| `--aux-weight` | 0.1 | Weight for auxiliary loss |
-| `--no-g2lu` | — | Disable G²LU, use standard SwiGLU in mirrored arch |
-### Training
-| Flag | Default | Description |
-|---|---|---|
-| `--lr` | 3e-4 | Peak learning rate |
-| `--min-lr` | 0.0 | LR floor for cosine schedule |
-| `--warmup-steps` | 100 | LR warmup steps |
-| `--epochs` | 10 | Training epochs |
-| `--batch-size` | 32 | Micro-batch size |
-| `--grad-accum` | 1 | Gradient accumulation steps |
-| `--context-length` | 512 | Sequence length |
-| `--bf16` / `--fp16` | — | Mixed precision |
-| `--compile` | — | `torch.compile` the model |
-### Data
-| Flag | Default | Description |
-|---|---|---|
-| `--data` | — | Path or `hf:dataset_name` |
-| `--text-column` | `text` | Column name for HF datasets |
-| `--tokenizer` | `gpt2` | Tokenizer name or path |
-| `--num-samples` | — | Limit dataset size |
-## Architecture Details
-### Why Mirroring Works
-Mirroring only works due to the additional gate. W3 and W4 specialize to serve different roles despite sharing weights — spectral analysis confirms the gates swap their stable-rank profiles at the architectural midpoint. The order of mirror layers may be rearrangeable, as the gates adapt to whatever representations flow through them.
-### Why G²LU Works
-Standard SwiGLU creates hyperplane decision boundaries — broad, overlapping activation regions. G²LU's nested gate creates **saddle surfaces** — narrow activation bands with isolation gaps (like a spectral comb filter). This has three effects:
-1. **Anti-memorization.** The gate geometry cannot form sharp, input-specific activations. The model is forced toward broad, relational features.
-2. **Higher LR tolerance.** Narrow activation bands leave headroom between features. Large gradient updates shift features within their bands without colliding.
-3. **Compositional detection.** Each neuron natively computes conjunctions (A AND B), not just thresholds. Might be useful for morphology, syntax, and structural reasoning.
-G²LU can be seen as occupying a point between standard GLU (fixed activation, fixed gate) and KAN (fully learned activations): the activation function is fixed (silu), but its effective shape adapts per-input through the nested gate.
-### Why WoRPE Works
-BPE tokenizers already mark word boundaries (`Ġ` for GPT-2, `▁` for SentencePiece). WoRPE surfaces this information geometrically in a dedicated subspace of the rotary embedding, so the model gets word-internal position for free instead of rediscovering it from attention patterns. Requires G²LU to exploit effectively — the saddle surfaces compute morphological conjunctions ("position-0 AND prefix-pattern") that single gates cannot.
-### Why Everything Works Together
-The optimization landscape of this architecture is substantially more complex than a standard transformer — shared weights must serve both directions, nested gates must coordinate, and the hourglass bottleneck constrains information flow. This appears to be only tractable when anchored by pre-trained, weight-tied embeddings that provide a stable coordinate system. The frozen embeddings give the model fixed reference geometry, allowing convergence despite the architectural complexity.
-## File Map
-```
-circuits/
-  config.py          — CLI arguments, presets, CircuitConfig
-  layers.py          — RMSNorm, RoPE, WoRPE, CausalAttention, SwiGLU
-  model.py           — CircuitTransformer (standard baseline)
-  mirrored.py        — MirroredTransformer, G²LU, MirroredBlock
-  train.py           — Training loop, LR schedule, checkpointing
-  data.py            — MemmapDataset, parallel tokenization, HF/text loading
-  generate.py        — Text generation with KV caching
-  bench.py           — Benchmark runner and comparison tables
-  lm_eval_wrapper.py — EleutherAI lm-eval harness integration
-  graft_g2lu.py      — Surgical G²LU upgrade for pretrained models (experimental/untested)
-  scripts/           — Analysis scripts
-```
-## Origin
-Prisma grew from interpretability research on _layer grafting_ (writing in progress) in Llama 3.2, which suggests that one of the ways that transformers might self organize to process language can be seen as like a mirrored structure that expands from tokens to semantics, then compressing back — bringing the interpretive analogy of seeing it as a biconvex lens with fractures or polarizing filters within its body. If the two halves are symmetric structurally, they can share weights. The gate (fractures/polarizing filters) becomes the minimum surgical unit for changing behavior. A single weightset becomes insufficient due to shared weights, which brought the question of how to properly make two gates efficiently collaborate.
-G²LU emerged from the observation that for a pair of gates to be expressive and atomic, _one gate needs to be in function of the other_.
-WoRPE emerged from noticing, that tokenizers already carry word structure but positional encodings ignore it — providing hints to the model allows faster convergence during training.
-The architecture is a processing engine that plugs into pretrained tokenizer embeddings. The tokenizer is load-bearing infrastructure — Prisma operates within a pre-existing coordinate system.
-## Developer Notes
-This model is the outcome of a POC done by a single individual with limited resources, further investigation, training and tests are being slowly conducted as time and conditions allow.
-The proposed architecture was only fully trained on top of `facebook/MobileLLM-125M` tokenizer and weight-tied embeddings. It might be the case that it doesn't work as expected on untied embeddings and it is highly likely that it is impossible to train a model with this architecture without a pre-trained tokenizer.
-Different arrangements of the architecture (varying middle layer count, mirror depth, width) would likely produce different results. Only this setup — with 1 middle layer — was tested, as a validation of whether the architecture works at all. The extreme case was chosen deliberately: if the bottleneck configuration most prone to failure still produces competitive results, less constrained configurations should too.
-Factorized dimensions for embeddings and an intermediate down proj before the output head were attempted, and nothing useful came out of it.
-It is completely unknown if the architecture is beneficial for larger models (1B+) — observations suggests it might.
-## Training
-- **Architecture**:
-  - 41 layers
-    - 20 with shared W1 and W2
-    - 1 unique
-  - 1024 dimms
-  - 16 GQA heads, 4 KV heads (4:1)
-  - vocab size 32k
-  - RoPE + WoRPE + G²LU
-- **Pretraining tokens**: 30B
-- **Precision**: bfloat16
-- **Tokenizer/Embeddings**: facebook/MobileLLM-125M
-- **Hardware**: 1 H100
-## Disclaimer
-This model is developed as a research model and it hasn't been tested thoroughly regarding synthesis and coherence quality, as its size is somewhat limiting. Use it at your own risk.

+---
+license: mit
+datasets:
+- Bingsu/openwebtext_20p
+- HuggingFaceFW/fineweb-edu
+language:
+- en
+pipeline_tag: text-generation
+---
+# Prisma
+A prototype model that is assembled as a mirrored transformer architecture with nested gating (adds an extra weight to the FFN) and morphological position encoding. It proposes that the model architecture creates different scaffolding, leading to different training regimens and capabilities.
+Prisma is only viable as it piggybacks on pre-trained tokenizers and their weight-tied embeddings, it decomposes the transformer architecture into symmetric **expand** and **compress** phases that share structural weights, connected by a small number of unique **middle** layers. Information expands from tokens to semantics, then compresses back — like light through a prism.
+```
+Token Embeddings
+      |
+  [ Expand  ]  ─── mirror pair 1 (W1, W2 shared) ── G²LU gate (W3·W4)
+  [ Expand  ]  ─── mirror pair 2
+  [  ....   ]  ─── mirror pair N
+      |
+  [ Middle  ]  ─── unique layers (full capacity, not shared)
+      |
+  [Compress ]  ─── mirror pair N (same W1, W2 as expand N)
+  [Compress ]  ─── mirror pair 2
+  [Compress ]  ─── mirror pair 1
+      |
+   LM Head (weight-tied to embeddings)
+```
+## Key Concepts
+**Mirrored layers.** Each expand layer shares W1 (projection) and W2 (output) weights with its corresponding compress layer. The architecture gets 2N virtual layers of processing from N unique parameter sets. At 357M parameters, Prisma runs 41 virtual layers from ~20 unique weight sets + 1 middle layer.
+**G²LU — Gated-Gated Linear Unit.** The gate is itself gated:
+Where typical gated transformers have `y  = W2 @ (W1 @ x * silu(W3 @ x))`, Prisma has:
+```python
+g4 = silu(W4 @ x)              # inner gate
+g3 = silu(W3 @ x * g4)         # outer gate, modulated by inner
+y  = W2 @ (W1 @ x * g3)        # gated output
+```
+One gate in function of the other. Creates quadratic (saddle-surface) decision boundaries instead of linear hyperplanes — each neuron computes a conjunction ("feature A AND feature B") rather than a single threshold. This produces narrow, separated activation channels that resist memorization and tolerate significantly higher learning rates. Part of the parameters saved with mirroring are re-distributed as W4.
+**WoRPE — Word-position Rotary Position Embedding.** Dedicates a small subspace of each attention head to encode position within a word (0 = prefix, 1 = second subword, ...). The information is already in the BPE tokenizer's word-boundary markers — WoRPE surfaces it geometrically so the model doesn't have to rediscover it. No new tokenizer required.
+**Auxiliary skip prediction.** An optional second head predicts t+K tokens ahead, providing gradient signal that rewards structural representations over local memorization. At K=1, functions as a dual-supervision regularizer through an untied projection.
+## Results
+### ~50M scale prototype (WikiText-103, 4 epochs)
+| Model | Params | LR | WikiText PPL | LAMBADA |
+|---|---|---|---|---|
+| Standard SwiGLU | 51M | 1e-4 | 4125 | 0.002 |
+| Prisma (G²LU) | 47M | 1e-4 | 2914 | 0.001 |
+| Prisma (G²LU + WoRPE) | 51M | 1e-2 | 921 | 0.082 |
+*Standard trained 10 epochs; Prisma (G²LU + WoRPE) shown at 1 epoch — the point is LR tolerance, not epoch-matched comparison.*
+The regularization stack (mirroring + G²LU + WoRPE) enables training at **100x the standard learning rate** without instability.
+### ~350M scale prototype — comparison with published models
+Prisma 357M trained on ~30B tokens (OpenWebText 20% + FineWeb-Edu 10BT continued training), compared against published models at similar scale:
+| Model | Params | Train Data | ARC-C\* | ARC-E\* | BoolQ | HellaSwag\* | LAMBADA | PIQA\* | WikiText\*\* | WinoGrande |
+|---|---|---|---|---|---|---|---|---|---|---|
+| GPT-2 medium | 355M | 40B | 0.250 | 0.436 | 0.586 | 0.394 | **0.430** | 0.664 | **26.75** | **0.531** |
+| Baguettotron | 321M | 200B | 0.302 | 0.506 | 0.589 | 0.354 | 0.294 | 0.618 | 30.93 | 0.530 |
+| SmolLM-360M | 360M | 600B | **0.359** | **0.640** | 0.550 | **0.536** | **0.455** | **0.715** | **19.49** | **0.570** |
+| SmolLM2-360M | 360M | 4000B | **0.381** | **0.681** | 0.617 | 0.431 | **0.532** | **0.718** | **15.67** | **0.586** |
+| LFM2-350M | 350M | 10000B | **0.393** | **0.662** | **0.642** | **0.489** | 0.399 | 0.698 | 25.68 | **0.558** |
+| **Prisma** | **357M** | **30B** | 0.290 | **0.548** | **0.620** | **0.427** | 0.362 | **0.670** | 27.40 | 0.506 |
+\* *normalized accuracy* · \*\* *word perplexity*
+**Key findings:**
+- **Beats GPT-2 medium on 5/8 benchmarks** (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA) with 25% less training data.
+- **Beats Baguettotron (200B) on 6/8 benchmarks** — including PPL — with **7x less data.**
+- **BoolQ 0.620** exceeds all models except LFM2 (10000B) and SmolLM2 (4000B). The anti-memorization properties of G²LU force genuine comprehension instead of statistical shortcuts.
+- **ARC-Easy 0.548** — the largest absolute gain over GPT-2 medium (+11.2pp). FineWeb-Edu knowledge absorbed efficiently through G²LU's relational features.
+- Prisma wins on **reasoning benchmarks** (ARC, HellaSwag, PIQA, BoolQ). Models trained on 20-300x more data win on **content prediction** (LAMBADA, PPL). The architecture trades raw memorization for data-efficient knowledge application.
+### Training progression (~350M)
+| Stage | LR | ARC-C | ARC-E | BoolQ | HellaSwag | LAMBADA | PIQA | PPL |
+|---|---|---|---|---|---|---|---|---|
+| Standard 336M baseline | 1e-4 | 0.228 | 0.341 | 0.618 | 0.280 | 0.226 | 0.574 | 77.2 |
+| Prisma 41L (OWT 20%) | 5e-4 | 0.238 | 0.394 | 0.585 | 0.317 | 0.313 | 0.614 | 44.8 |
+| + WoRPE (OWT 20%) | 1e-3 | 0.247 | 0.397 | 0.595 | 0.331 | 0.333 | 0.614 | 43.5 |
+| + continued (FineWeb c1) | 1e-3 | 0.249 | 0.434 | 0.601 | 0.333 | 0.312 | 0.626 | 34.7 |
+| + continued (FineWeb c2) | 1e-3 | 0.290 | 0.548 | 0.620 | 0.427 | 0.362 | 0.670 | 27.4 |
+## Quick Start
+### Install
+```bash
+pip install -r Prisma/requirements.txt
+```
+### Train
+```bash
+# Small Prisma (~47M) on WikiText-103
+python -m Prisma.train \
+  --arch mirrored --dims 384 --heads 6 --kv-heads 2 --layers 57 --n-middle 1 \
+  --tokenizer facebook/MobileLLM-125M \
+  --word-rope-dims 8 --word-rope-base 10.0 \
+  --data hf:wikitext:wikitext-103-raw-v1:train \
+  --epochs 4 --batch-size 32 --context-length 512 \
+  --lr 1e-2 --warmup-steps 500 --bf16 --gpu 0
+# 324M Prisma on OpenWebText
+python -m Prisma.train \
+  --gpu 0 --compile --bf16 --arch mirrored \
+  --dims 1024 --heads 16 --kv-heads 4 --layers 41 --n-middle 1 \
+  --word-rope-dims 8 --word-rope-base 10.0 \
+  --tokenizer facebook/MobileLLM-125M \
+  --data hf:Bingsu/openwebtext_20p --text-column text \
+  --epochs 4 --batch-size 12 --context-length 1024 --grad-accum 42 \
+  --lr 1e-3 --warmup 500 \
+  --log-every 5 --val-every 1000 --save-every 1000 \
+  --checkpoint-dir path/to/checkpoints/your_model/
+```
+### Generate
+```bash
+python -m Prisma.generate \
+  --checkpoint path/to/checkpoints/your_model/best.pt \
+  --prompt "A thought observing itself discovers that it" \
+  --max-tokens 256 --temperature 0.8 --top-p 0.95
+```
+### Benchmark
+```bash
+# Single model
+python -m Prisma.bench \
+  --checkpoint path/to/checkpoints/your_model/best.pt \
+  --tasks arc_easy,lambada_openai,piqa,hellaswag,winogrande,wikitext \
+  --gpu 0
+```
+## CLI Reference
+### Architecture
+| Flag | Default | Description |
+|---|---|---|
+| `--arch` | `mirrored` | Architecture: `standard`, `mirrored`, `graft_g2lu` (experimental) |
+| `--dims` | 512 | Hidden dimension |
+| `--heads` | 8 | Number of attention heads |
+| `--kv-heads` | — | KV heads for GQA (omit = MHA) |
+| `--layers` | 12 | Total virtual layers (expand + middle + compress) |
+| `--n-middle` | 2 | Unique (non-mirrored) middle layers |
+### Prisma-Specific
+| Flag | Default | Description |
+|---|---|---|
+| `--word-rope-dims` | 0 | Head dims for WoRPE (0 = disabled, try 8) |
+| `--word-rope-base` | 10.0 | WoRPE frequency base |
+| `--aux-skip` | 0 | Skip-ahead prediction distance (0 = disabled) |
+| `--aux-weight` | 0.1 | Weight for auxiliary loss |
+| `--no-g2lu` | — | Disable G²LU, use standard SwiGLU in mirrored arch |
+### Training
+| Flag | Default | Description |
+|---|---|---|
+| `--lr` | 3e-4 | Peak learning rate |
+| `--min-lr` | 0.0 | LR floor for cosine schedule |
+| `--warmup-steps` | 100 | LR warmup steps |
+| `--epochs` | 10 | Training epochs |
+| `--batch-size` | 32 | Micro-batch size |
+| `--grad-accum` | 1 | Gradient accumulation steps |
+| `--context-length` | 512 | Sequence length |
+| `--bf16` / `--fp16` | — | Mixed precision |
+| `--compile` | — | `torch.compile` the model |
+### Data
+| Flag | Default | Description |
+|---|---|---|
+| `--data` | — | Path or `hf:dataset_name` |
+| `--text-column` | `text` | Column name for HF datasets |
+| `--tokenizer` | `gpt2` | Tokenizer name or path |
+| `--num-samples` | — | Limit dataset size |
+## Architecture Details
+### Why Mirroring Works
+Mirroring only works due to the additional gate. W3 and W4 specialize to serve different roles despite sharing weights — spectral analysis confirms the gates swap their stable-rank profiles at the architectural midpoint. The order of mirror layers may be rearrangeable, as the gates adapt to whatever representations flow through them.
+### Why G²LU Works
+Standard SwiGLU creates hyperplane decision boundaries — broad, overlapping activation regions. G²LU's nested gate creates **saddle surfaces** — narrow activation bands with isolation gaps (like a spectral comb filter). This has three effects:
+1. **Anti-memorization.** The gate geometry cannot form sharp, input-specific activations. The model is forced toward broad, relational features.
+2. **Higher LR tolerance.** Narrow activation bands leave headroom between features. Large gradient updates shift features within their bands without colliding.
+3. **Compositional detection.** Each neuron natively computes conjunctions (A AND B), not just thresholds. Might be useful for morphology, syntax, and structural reasoning.
+G²LU can be seen as occupying a point between standard GLU (fixed activation, fixed gate) and KAN (fully learned activations): the activation function is fixed (silu), but its effective shape adapts per-input through the nested gate.
+### Why WoRPE Works
+BPE tokenizers already mark word boundaries (`Ġ` for GPT-2, `▁` for SentencePiece). WoRPE surfaces this information geometrically in a dedicated subspace of the rotary embedding, so the model gets word-internal position for free instead of rediscovering it from attention patterns. Requires G²LU to exploit effectively — the saddle surfaces compute morphological conjunctions ("position-0 AND prefix-pattern") that single gates cannot.
+### Why Everything Works Together
+The optimization landscape of this architecture is substantially more complex than a standard transformer — shared weights must serve both directions, nested gates must coordinate, and the hourglass bottleneck constrains information flow. This appears to be only tractable when anchored by pre-trained, weight-tied embeddings that provide a stable coordinate system. The frozen embeddings give the model fixed reference geometry, allowing convergence despite the architectural complexity.
+## File Map
+```
+Prisma/
+  config.py          — CLI arguments, presets, CircuitConfig
+  layers.py          — RMSNorm, RoPE, WoRPE, CausalAttention, SwiGLU
+  model.py           — CircuitTransformer (standard baseline)
+  mirrored.py        — MirroredTransformer, G²LU, MirroredBlock
+  train.py           — Training loop, LR schedule, checkpointing
+  data.py            — MemmapDataset, parallel tokenization, HF/text loading
+  generate.py        — Text generation with KV caching
+  bench.py           — Benchmark runner and comparison tables
+  lm_eval_wrapper.py — EleutherAI lm-eval harness integration
+  graft_g2lu.py      — Surgical G²LU upgrade for pretrained models (experimental/untested)
+  scripts/           — Analysis scripts
+```
+## Origin
+Prisma grew from interpretability research on _layer grafting_ (writing in progress) in Llama 3.2, which suggests that one of the ways that transformers might self organize to process language can be seen as like a mirrored structure that expands from tokens to semantics, then compressing back — bringing the interpretive analogy of seeing it as a biconvex lens with fractures or polarizing filters within its body. If the two halves are symmetric structurally, they can share weights. The gate (fractures/polarizing filters) becomes the minimum surgical unit for changing behavior. A single weightset becomes insufficient due to shared weights, which brought the question of how to properly make two gates efficiently collaborate.
+G²LU emerged from the observation that for a pair of gates to be expressive and atomic, _one gate needs to be in function of the other_.
+WoRPE emerged from noticing, that tokenizers already carry word structure but positional encodings ignore it — providing hints to the model allows faster convergence during training.
+The architecture is a processing engine that plugs into pretrained tokenizer embeddings. The tokenizer is load-bearing infrastructure — Prisma operates within a pre-existing coordinate system.
+## Developer Notes
+This model is the outcome of a POC done by a single individual with limited resources, further investigation, training and tests are being slowly conducted as time and conditions allow.
+The proposed architecture was only fully trained on top of `facebook/MobileLLM-125M` tokenizer and weight-tied embeddings. It might be the case that it doesn't work as expected on untied embeddings and it is highly likely that it is impossible to train a model with this architecture without a pre-trained tokenizer.
+Different arrangements of the architecture (varying middle layer count, mirror depth, width) would likely produce different results. Only this setup — with 1 middle layer — was tested, as a validation of whether the architecture works at all. The extreme case was chosen deliberately: if the bottleneck configuration most prone to failure still produces competitive results, less constrained configurations should too.
+Factorized dimensions for embeddings and an intermediate down proj before the output head were attempted, and nothing useful came out of it.
+It is completely unknown if the architecture is beneficial for larger models (1B+) — observations suggests it might.
+## Training
+- **Architecture**:
+  - 41 layers
+    - 20 with shared W1 and W2
+    - 1 unique
+  - 1024 dimms
+  - 16 GQA heads, 4 KV heads (4:1)
+  - vocab size 32k
+  - RoPE + WoRPE + G²LU
+- **Pretraining tokens**: 30B
+- **Precision**: bfloat16
+- **Tokenizer/Embeddings**: facebook/MobileLLM-125M
+- **Hardware**: 1 H100
+## Disclaimer
+This model is developed as a research model and it hasn't been tested thoroughly regarding synthesis and coherence quality, as its size is somewhat limiting. Use it at your own risk.
+## Citation
+```
+@misc{ivatchkovitch2026prisma,
+      title={Prisma: Interpretability-Inspired Mirrored Transformer Architecture},
+      author={Yuri Ivatchkovitch},
+      year={2026},
+      howpublished={\url{https://huggingface.co/y3i12/Prisma}},
+}
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+Circuits: Minimal Transformer for Semantic Circuitry Experiments.
+A clean, self-contained transformer implementation designed for
+experimenting with neural networks.
+"""
+from .config import CircuitConfig
+from .model import CircuitTransformer, count_parameters
+from .mirrored import MirroredConfig, MirroredTransformer, count_mirrored_parameters
+from .data import get_tokenizer, load_data, create_dataloader, TextDataset
+from .graft_g2lu import G2LU_GraftedModel, G2LU_MLP, load_g2lu_model
+__all__ = [
+    "CircuitConfig",
+    "CircuitTransformer",
+    "count_parameters",
+    "MirroredConfig",
+    "MirroredTransformer",
+    "count_mirrored_parameters",
+    "get_tokenizer",
+    "load_data",
+    "create_dataloader",
+    "TextDataset",
+    "G2LU_GraftedModel",
+    "G2LU_MLP",
+    "load_g2lu_model",
+]

bench.py ADDED Viewed

	@@ -0,0 +1,176 @@

+#!/usr/bin/env python3
+"""
+Benchmark Circuit transformer family against standard LM tasks.
+Usage:
+    # Single model
+    python -m circuits.bench --checkpoint circuits/checkpoints/slot_local_mirrored/best.pt --gpu 0
+    # Compare all architectures
+    python -m circuits.bench --compare --gpu 0
+    # Quick sanity check (100 samples per task)
+    python -m circuits.bench --compare --gpu 0 --limit 100
+    # Specific tasks
+    python -m circuits.bench --checkpoint path/to/best.pt --tasks hellaswag,lambada_openai
+"""
+import argparse
+import json
+import time
+import torch
+from pathlib import Path
+import lm_eval
+from lm_eval.api.registry import register_model
+from .lm_eval_wrapper import CircuitLM
+# Register so lm_eval can find it
+register_model("circuit")(CircuitLM)
+DEFAULT_TASKS = "arc_challenge,arc_easy,boolq,hellaswag,lambada_openai,piqa,wikitext,winogrande"
+# Known checkpoints for --compare mode
+CHECKPOINTS = {
+    "standard_12L": "circuits/checkpoints/flat/best.pt",
+    "mirrored_9L_wide": "circuits/checkpoints/hier_wide_2/best.pt",
+    "mirrored_15L_deep": "circuits/checkpoints/hier_resized/best.pt",
+    "slot_local_mirrored": "circuits/checkpoints/slot_local_mirrored/best.pt",
+}
+def run_benchmark(checkpoint: str, tasks: str, device: str, limit: int = None, batch_size: int = 1, compile: bool = False):
+    """Run lm-eval on a single checkpoint."""
+    model_args = f"checkpoint={checkpoint},device={device},batch_size={batch_size},compile={'true' if compile else 'false'}"
+    task_list = tasks.split(",")
+    results = lm_eval.simple_evaluate(
+        model="circuit",
+        model_args=model_args,
+        tasks=task_list,
+        limit=limit,
+    )
+    return results
+def extract_scores(results: dict) -> dict:
+    """Pull headline metrics from lm-eval results."""
+    scores = {}
+    if "results" not in results:
+        return scores
+    for task_name, task_results in results["results"].items():
+        # Get the primary metric (usually acc or acc_norm)
+        if "acc_norm,none" in task_results:
+            scores[task_name] = task_results["acc_norm,none"]
+        elif "acc,none" in task_results:
+            scores[task_name] = task_results["acc,none"]
+        elif "perplexity,none" in task_results:
+            scores[task_name] = task_results["perplexity,none"]
+        elif "word_perplexity,none" in task_results:
+            scores[task_name] = task_results["word_perplexity,none"]
+    return scores
+def print_comparison(all_results: dict, tasks: list):
+    """Pretty-print comparison table."""
+    # Header
+    col_width = max(len(t) for t in tasks) + 2
+    name_width = max(len(n) for n in all_results) + 2
+    header = f"{'Model':<{name_width}}"
+    for task in tasks:
+        header += f"{task:>{col_width}}"
+    header += f"{'  avg':>8}"
+    print("\n" + "=" * len(header))
+    print(header)
+    print("-" * len(header))
+    for name, scores in all_results.items():
+        row = f"{name:<{name_width}}"
+        vals = []
+        for task in tasks:
+            val = scores.get(task, None)
+            if val is not None:
+                row += f"{val:>{col_width}.4f}"
+                vals.append(val)
+            else:
+                row += f"{'N/A':>{col_width}}"
+        avg = sum(vals) / len(vals) if vals else 0
+        row += f"{avg:>8.4f}"
+        print(row)
+    print("=" * len(header))
+def main():
+    parser = argparse.ArgumentParser(description="Benchmark Circuit transformers")
+    parser.add_argument("--checkpoint", type=str, help="Path to single checkpoint")
+    parser.add_argument("--compare", action="store_true", help="Compare all known architectures")
+    parser.add_argument("--tasks", type=str, default=DEFAULT_TASKS, help="Comma-separated task list")
+    parser.add_argument("--gpu", type=int, default=0, help="GPU index")
+    parser.add_argument("--limit", type=int, default=None, help="Limit samples per task (for quick testing)")
+    parser.add_argument("--batch-size", type=int, default=1, help="Batch size")
+    parser.add_argument("--output", type=str, default=None, help="Save results to JSON")
+    parser.add_argument("--compile", action="store_true", help="torch.compile models for faster inference")
+    args = parser.parse_args()
+    device = f"cuda:{args.gpu}"
+    task_list = args.tasks.split(",")
+    if args.compare:
+        all_scores = {}
+        all_raw = {}
+        # Filter to existing checkpoints
+        available = {k: v for k, v in CHECKPOINTS.items() if Path(v).exists()}
+        missing = {k: v for k, v in CHECKPOINTS.items() if not Path(v).exists()}
+        if missing:
+            print(f"Skipping (not found): {', '.join(missing.keys())}")
+        for name, ckpt_path in available.items():
+            print(f"\n{'='*60}")
+            print(f"Evaluating: {name}")
+            print(f"Checkpoint: {ckpt_path}")
+            print(f"{'='*60}")
+            t0 = time.time()
+            results = run_benchmark(ckpt_path, args.tasks, device, args.limit, args.batch_size, args.compile)
+            elapsed = time.time() - t0
+            scores = extract_scores(results)
+            all_scores[name] = scores
+            all_raw[name] = results.get("results", {})
+            print(f"  Completed in {elapsed:.0f}s: {scores}")
+        print_comparison(all_scores, task_list)
+        if args.output:
+            with open(args.output, "w") as f:
+                json.dump({"scores": all_scores, "raw": all_raw}, f, indent=2, default=str)
+            print(f"\nResults saved to {args.output}")
+    elif args.checkpoint:
+        print(f"Evaluating: {args.checkpoint}")
+        t0 = time.time()
+        results = run_benchmark(args.checkpoint, args.tasks, device, args.limit, args.batch_size, args.compile)
+        elapsed = time.time() - t0
+        scores = extract_scores(results)
+        print(f"\nResults ({elapsed:.0f}s):")
+        for task, score in scores.items():
+            print(f"  {task}: {score:.4f}")
+        if args.output:
+            with open(args.output, "w") as f:
+                json.dump(results, f, indent=2, default=str)
+            print(f"\nResults saved to {args.output}")
+    else:
+        parser.print_help()
+if __name__ == "__main__":
+    main()

coherence_eval.py ADDED Viewed

	@@ -0,0 +1,834 @@

+#!/usr/bin/env python3
+"""
+Coherence evaluation for language models.
+Measures what standard benchmarks can't see:
+  Tier 1 — Generation diversity (repetition, collapse detection)
+  Tier 2 — Multi-distance prediction (context utilization, skip accuracy)
+  Tier 3 — Semantic consistency (chunk similarity over long generations)
+Usage:
+    # Custom checkpoint
+    python -m circuits.coherence_eval --checkpoint circuits/checkpoints/model/best.pt
+    # HuggingFace model
+    python -m circuits.coherence_eval --model gpt2
+    # Compare models
+    python -m circuits.coherence_eval --model EleutherAI/pythia-160m --gpu 0
+    # Quick test (fewer prompts, shorter generation)
+    python -m circuits.coherence_eval --checkpoint path/to/model.pt --num-prompts 5 --gen-length 256
+    # Run specific tiers
+    python -m circuits.coherence_eval --checkpoint path/to/model.pt --tiers 1,3
+"""
+import argparse
+import json
+import math
+import sys
+import time
+from pathlib import Path
+import torch
+import torch.nn.functional as F
+# ──────────────────────────────────────────────────────────────────────
+# Default prompts — diverse domains, 10-20 tokens each
+# ──────────────────────────────────────────────────────────────────────
+DEFAULT_PROMPTS = [
+    "A thought observing itself discovers that it",
+    "The history of science shows that",
+    "In the middle of the night, the old house",
+    "The relationship between language and thought has been",
+    "When the first settlers arrived, they found",
+    "The mathematical proof begins by assuming",
+    "She opened the door to find",
+    "The economic implications of this policy",
+    "Deep beneath the ocean surface, researchers discovered",
+    "The most important lesson from this experiment is",
+    "According to recent studies, the human brain",
+    "The old library contained books that",
+    "As the temperature continued to rise, the effects on",
+    "The development of artificial intelligence has raised questions about",
+    "In the small village at the foot of the mountain",
+    "The fundamental principles of democracy require",
+    "Looking through the telescope, the astronomer noticed",
+    "The relationship between music and emotion",
+    "During the industrial revolution, working conditions",
+    "The ancient manuscript revealed secrets about",
+]
+# ──────────────────────────────────────────────────────────────────────
+# Model wrapper — unified interface for circuit models and HF models
+# ──────────────────────────────────────────────────────────────────────
+class ModelWrapper:
+    """Unified interface for custom circuit models and HuggingFace models."""
+    def __init__(self, model, tokenizer, device, model_type="hf",
+                 skip_head=None, skip_k=0, max_seq_len=1024, name="unknown"):
+        self.model = model
+        self.tokenizer = tokenizer
+        self.device = device
+        self.model_type = model_type  # "circuit" or "hf"
+        self.skip_head = skip_head
+        self.skip_k = skip_k
+        self.max_seq_len = max_seq_len
+        self.name = name
+    @classmethod
+    def from_checkpoint(cls, path, device):
+        """Load a custom circuit model from checkpoint."""
+        from .config import CircuitConfig
+        from .model import CircuitTransformer
+        from .mirrored import MirroredConfig, MirroredTransformer
+        from .slotted_mirrored import SlotMirroredConfig, SlotMirroredTransformer
+        from .data import get_tokenizer
+        checkpoint = torch.load(path, map_location="cpu", weights_only=False)
+        model_type = checkpoint.get("model_type", "standard")
+        if model_type == "slot_mirrored":
+            config = SlotMirroredConfig.from_dict(checkpoint["config"])
+            model = SlotMirroredTransformer(config).to(device)
+            arch_desc = f"SlotMirrored ({config.n_slots} slots)"
+        elif model_type == "mirrored":
+            config = MirroredConfig.from_dict(checkpoint["config"])
+            model = MirroredTransformer(config).to(device)
+            arch_desc = "Mirrored"
+        else:
+            config = CircuitConfig.from_dict(checkpoint["config"])
+            model = CircuitTransformer(config).to(device)
+            arch_desc = "Standard"
+        # Handle torch.compile prefix
+        state_dict = checkpoint["model"]
+        if any(k.startswith("_orig_mod.") for k in state_dict):
+            state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+        model.load_state_dict(state_dict)
+        model.eval()
+        tokenizer = get_tokenizer()
+        skip_head = model.skip_head if hasattr(model, 'skip_head') else None
+        skip_k = getattr(config, 'aux_skip_k', 0)
+        max_seq_len = config.max_seq_len
+        params = sum(p.numel() for p in model.parameters()) / 1e6
+        name = f"{Path(path).parent.name}/{Path(path).stem} ({arch_desc}, {params:.1f}M)"
+        return cls(model, tokenizer, device, model_type="circuit",
+                   skip_head=skip_head, skip_k=skip_k,
+                   max_seq_len=max_seq_len, name=name)
+    @classmethod
+    def from_pretrained(cls, model_name, device):
+        """Load a HuggingFace model."""
+        from transformers import AutoModelForCausalLM, AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name, trust_remote_code=True,
+            torch_dtype=torch.float32,
+        ).to(device)
+        model.eval()
+        max_seq_len = getattr(model.config, 'max_position_embeddings', 1024)
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        params = sum(p.numel() for p in model.parameters()) / 1e6
+        name = f"{model_name} ({params:.1f}M)"
+        return cls(model, tokenizer, device, model_type="hf",
+                   max_seq_len=max_seq_len, name=name)
+    @property
+    def has_skip_head(self):
+        return self.skip_head is not None and self.skip_k > 0
+    def generate(self, prompt_text, max_new_tokens=512):
+        """Generate tokens at temperature 0 (greedy). Returns generated token IDs only."""
+        prompt_ids = self.tokenizer.encode(prompt_text, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            if self.model_type == "hf":
+                output_ids = self.model.generate(
+                    prompt_ids,
+                    max_new_tokens=max_new_tokens,
+                    do_sample=True,
+                    pad_token_id=self.tokenizer.pad_token_id,
+                    temperature=0.8,
+                    top_k=50,
+                    top_p=0.9,
+                    repetition_penalty=1.2,
+                )
+            else:
+                output_ids = self.model.generate(
+                    prompt_ids,
+                    max_new_tokens=max_new_tokens,
+                    temperature=0.8,
+                    top_k=50,
+                    top_p=0.9,
+                    repetition_penalty=1.2,
+                )
+        # Return only the generated part
+        gen_ids = output_ids[0, prompt_ids.shape[1]:]
+        return prompt_ids[0], gen_ids
+    def forward_with_hidden(self, input_ids):
+        """Forward pass returning (logits, hidden_states, skip_logits_or_None).
+        input_ids: [1, L] tensor.
+        """
+        with torch.no_grad():
+            if self.model_type == "hf":
+                outputs = self.model(input_ids, output_hidden_states=True)
+                logits = outputs.logits
+                hidden = outputs.hidden_states[-1]
+                return logits, hidden, None
+            else:
+                # Hook into norm layer to capture pre-lm_head hidden states
+                hidden_capture = {}
+                def hook_fn(module, inp, output):
+                    hidden_capture['h'] = output.detach()
+                handle = self.model.norm.register_forward_hook(hook_fn)
+                output = self.model(input_ids)
+                handle.remove()
+                logits = output['logits']
+                hidden = hidden_capture['h']
+                skip_logits = None
+                if self.has_skip_head:
+                    skip_logits = self.skip_head(hidden)
+                return logits, hidden, skip_logits
+    def forward(self, input_ids):
+        """Forward pass returning logits only. input_ids: [1, L] tensor."""
+        with torch.no_grad():
+            if self.model_type == "hf":
+                return self.model(input_ids).logits
+            else:
+                return self.model(input_ids)['logits']
+# ──────────────────────────────────────────────────────────────────────
+# Generation (shared between Tier 1 and Tier 3)
+# ──────────────────────────────────────────────────────────────────────
+def generate_all(wrapper, prompts, gen_length):
+    """Generate from all prompts. Returns list of (prompt_text, prompt_ids, gen_ids)."""
+    results = []
+    for prompt in prompts:
+        prompt_ids, gen_ids = wrapper.generate(prompt, max_new_tokens=gen_length)
+        results.append((prompt, prompt_ids, gen_ids))
+        print(f"  [{len(results)}/{len(prompts)}] {len(gen_ids)} tokens", end="\r")
+    print()
+    return results
+# ──────────────────────────────────────────────────────────────────────
+# Tier 1: Generation Diversity
+# ──────────────────────────────────────────────────────────────────────
+def ngrams(tokens, n):
+    """Extract n-grams from token list."""
+    return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
+def compute_diversity(gen_ids):
+    """Compute diversity metrics for a single generation."""
+    tokens = gen_ids.tolist()
+    n = len(tokens)
+    if n < 4:
+        return {"unique_1g": 0, "unique_2g": 0, "unique_3g": 0, "unique_4g": 0,
+                "max_repeat": n, "collapsed": True}
+    results = {}
+    for k in [1, 2, 3, 4]:
+        grams = ngrams(tokens, k)
+        results[f"unique_{k}g"] = len(set(grams)) / len(grams) if grams else 0.0
+    # Max consecutive identical token span
+    max_repeat = 1
+    current = 1
+    for i in range(1, n):
+        if tokens[i] == tokens[i - 1]:
+            current += 1
+            max_repeat = max(max_repeat, current)
+        else:
+            current = 1
+    results["max_repeat"] = max_repeat
+    # Longest repeated n-gram span (any n-gram repeated consecutively)
+    max_ngram_repeat = 1
+    for ng_size in [2, 3, 4, 5, 8]:
+        grams = ngrams(tokens, ng_size)
+        streak = 1
+        for i in range(1, len(grams)):
+            if grams[i] == grams[i - 1]:
+                streak += 1
+                max_ngram_repeat = max(max_ngram_repeat, streak * ng_size)
+            else:
+                streak = 1
+    results["max_ngram_repeat_span"] = max_ngram_repeat
+    # Collapse: unique 4-grams < 50% or max repeat span > 25% of generation
+    results["collapsed"] = (results["unique_4g"] < 0.5) or (max_ngram_repeat > n * 0.25)
+    return results
+def eval_diversity(generations, tokenizer, show_samples=3):
+    """Tier 1: Compute diversity metrics from pre-generated text."""
+    print("\n" + "=" * 60)
+    print("TIER 1: Generation Diversity")
+    print("=" * 60)
+    all_metrics = []
+    sample_texts = []
+    for i, (prompt, prompt_ids, gen_ids) in enumerate(generations):
+        metrics = compute_diversity(gen_ids)
+        metrics["prompt"] = prompt
+        metrics["gen_length"] = len(gen_ids)
+        all_metrics.append(metrics)
+        if i < show_samples:
+            text = tokenizer.decode(gen_ids, skip_special_tokens=True)
+            sample_texts.append((prompt, text))
+    n = len(all_metrics)
+    if n == 0:
+        print("  No generations to evaluate.")
+        return {}
+    # Aggregate
+    agg = {}
+    for key in ["unique_1g", "unique_2g", "unique_3g", "unique_4g",
+                "max_repeat", "max_ngram_repeat_span"]:
+        values = [m[key] for m in all_metrics]
+        agg[key] = {"mean": sum(values) / n, "min": min(values), "max": max(values)}
+    collapse_count = sum(1 for m in all_metrics if m["collapsed"])
+    agg["collapse_rate"] = collapse_count / n
+    avg_len = sum(m["gen_length"] for m in all_metrics) / n
+    # Print
+    print(f"\n  Prompts evaluated: {n}")
+    print(f"  Avg generation length: {avg_len:.0f} tokens")
+    print()
+    print(f"  {'Metric':<24} {'Mean':>8} {'Min':>8} {'Max':>8}")
+    print(f"  {'-' * 50}")
+    for key in ["unique_1g", "unique_2g", "unique_3g", "unique_4g"]:
+        m = agg[key]
+        print(f"  {key:<24} {m['mean']:>8.3f} {m['min']:>8.3f} {m['max']:>8.3f}")
+    for key in ["max_repeat", "max_ngram_repeat_span"]:
+        m = agg[key]
+        print(f"  {key:<24} {m['mean']:>8.1f} {int(m['min']):>8d} {int(m['max']):>8d}")
+    print(f"\n  Collapse rate: {collapse_count}/{n} ({agg['collapse_rate']:.1%})")
+    # Show samples
+    if sample_texts:
+        print(f"\n  --- Sample generations (first {len(sample_texts)}) ---")
+        for prompt, text in sample_texts:
+            print(f"\n  Prompt: \"{prompt}\"")
+            preview = text[:400].replace("\n", " ")
+            if len(text) > 400:
+                preview += "..."
+            print(f"  Output: {preview}")
+    return {"per_prompt": all_metrics, "aggregate": agg}
+# ──────────────────────────────────────────────────────────────────────
+# Tier 2: Multi-Distance Prediction
+# ────────────────���─────────────────────────────────────────────────────
+def prepare_eval_sequences(wrapper, num_sequences=50, data_source=None):
+    """Prepare ground truth sequences for Tier 2."""
+    max_len = wrapper.max_seq_len
+    if data_source and Path(data_source).exists():
+        with open(data_source) as f:
+            text = f.read()
+        all_ids = wrapper.tokenizer.encode(text)
+    else:
+        try:
+            from datasets import load_dataset
+            print("  Loading WikiText-103 validation...")
+            ds = load_dataset("wikitext", "wikitext-103-raw-v1",
+                              split="validation", trust_remote_code=True)
+            text = "\n".join(row["text"] for row in ds if row["text"].strip())
+            all_ids = wrapper.tokenizer.encode(text)
+        except Exception as e:
+            print(f"  Could not load eval data: {e}")
+            print(f"  Install 'datasets' or use --eval-data to provide a text file.")
+            return None
+    # Chunk into sequences
+    sequences = []
+    for i in range(0, len(all_ids) - max_len, max_len):
+        seq = torch.tensor(all_ids[i:i + max_len], dtype=torch.long)
+        sequences.append(seq)
+        if len(sequences) >= num_sequences:
+            break
+    if len(sequences) < 2:
+        print("  Not enough text for evaluation sequences.")
+        return None
+    print(f"  Prepared {len(sequences)} sequences of {max_len} tokens")
+    return sequences
+def eval_context_utilization(wrapper, sequences):
+    """Tier 2a: Per-position perplexity grouped by depth bucket."""
+    max_len = wrapper.max_seq_len
+    # Adaptive buckets based on max_seq_len
+    bucket_bounds = [0, 64, 128, 256, 512]
+    if max_len > 512:
+        bucket_bounds.append(max_len)
+    else:
+        bucket_bounds.append(max_len)
+    # Remove duplicates and sort
+    bucket_bounds = sorted(set(b for b in bucket_bounds if b <= max_len))
+    if bucket_bounds[-1] < max_len:
+        bucket_bounds.append(max_len)
+    buckets = [(bucket_bounds[i], bucket_bounds[i + 1])
+               for i in range(len(bucket_bounds) - 1)]
+    # Accumulate per-position losses
+    all_losses = []
+    for seq in sequences:
+        input_ids = seq.unsqueeze(0).to(wrapper.device)
+        logits = wrapper.forward(input_ids)
+        shift_logits = logits[0, :-1]
+        shift_labels = input_ids[0, 1:]
+        per_token_loss = F.cross_entropy(shift_logits, shift_labels, reduction='none')
+        all_losses.append(per_token_loss.cpu())
+        print(f"  [{len(all_losses)}/{len(sequences)}]", end="\r")
+    print()
+    # Compute per-bucket stats
+    stacked = torch.stack(all_losses)  # [N, L-1]
+    bucket_results = {}
+    for start, end in buckets:
+        s = min(start, stacked.shape[1])
+        e = min(end, stacked.shape[1])
+        if s >= e:
+            continue
+        bucket_losses = stacked[:, s:e]
+        avg_loss = bucket_losses.mean().item()
+        bucket_results[f"{start}-{end}"] = {
+            "loss": avg_loss,
+            "ppl": math.exp(min(avg_loss, 20)),  # cap to avoid overflow
+            "n_tokens": bucket_losses.numel(),
+        }
+    return bucket_results
+def eval_skip_accuracy(wrapper, sequences, distances):
+    """Tier 2b: Skip head prediction accuracy at various distances."""
+    if not wrapper.has_skip_head:
+        return None
+    results = {f"t+{K}": {"top1": [], "top5": []} for K in distances}
+    for seq in sequences:
+        input_ids = seq.unsqueeze(0).to(wrapper.device)
+        _, hidden, _ = wrapper.forward_with_hidden(input_ids)
+        for K in distances:
+            if K >= input_ids.shape[1]:
+                continue
+            skip_logits = wrapper.skip_head(hidden)  # [1, L, V]
+            targets = input_ids[0, K:]       # tokens at t+K
+            preds = skip_logits[0, :-K]      # predictions from position t
+            top1 = (preds.argmax(-1) == targets).float().mean().item()
+            top5_indices = preds.topk(min(5, preds.shape[-1]), dim=-1).indices
+            top5 = (top5_indices == targets.unsqueeze(-1)).any(-1).float().mean().item()
+            results[f"t+{K}"]["top1"].append(top1)
+            results[f"t+{K}"]["top5"].append(top5)
+        print(f"  [{len(results['t+' + str(distances[0])]['top1'])}/{len(sequences)}]", end="\r")
+    print()
+    # Average across sequences
+    avg_results = {}
+    for key in sorted(results.keys(), key=lambda x: int(x.split("+")[1])):
+        vals = results[key]
+        if vals["top1"]:
+            avg_results[key] = {
+                "top1": sum(vals["top1"]) / len(vals["top1"]),
+                "top5": sum(vals["top5"]) / len(vals["top5"]),
+            }
+    return avg_results
+def eval_structural(wrapper, eval_data, distances, num_sequences):
+    """Run Tier 2 evaluation."""
+    print("\n" + "=" * 60)
+    print("TIER 2: Structural Prediction")
+    print("=" * 60)
+    sequences = prepare_eval_sequences(wrapper, num_sequences, eval_data)
+    if sequences is None:
+        return {"context_utilization": None, "skip_accuracy": None}
+    # 2a: Context utilization
+    print("\n  --- 2a: Context Utilization (PPL by position depth) ---")
+    ctx_results = eval_context_utilization(wrapper, sequences)
+    if ctx_results:
+        print(f"\n  {'Depth':<12} {'Loss':>8} {'PPL':>10} {'Tokens':>10}")
+        print(f"  {'-' * 42}")
+        for bucket, vals in ctx_results.items():
+            print(f"  {bucket:<12} {vals['loss']:>8.3f} {vals['ppl']:>10.2f} {vals['n_tokens']:>10}")
+        buckets_list = list(ctx_results.values())
+        if len(buckets_list) >= 2:
+            ratio = buckets_list[0]["ppl"] / buckets_list[-1]["ppl"]
+            print(f"\n  Context utilization ratio (first/last): {ratio:.2f}x")
+            print(f"  (Higher = model benefits more from additional context)")
+    # 2b: Skip accuracy
+    skip_results = None
+    if wrapper.has_skip_head:
+        print(f"\n  --- 2b: Skip Head Accuracy (trained for t+{wrapper.skip_k}) ---")
+        skip_results = eval_skip_accuracy(wrapper, sequences, distances)
+        if skip_results:
+            print(f"\n  {'Distance':<12} {'Top-1':>8} {'Top-5':>8}")
+            print(f"  {'-' * 30}")
+            for key, vals in skip_results.items():
+                trained = " *" if int(key.split("+")[1]) == wrapper.skip_k else ""
+                print(f"  {key:<12} {vals['top1']:>8.4f} {vals['top5']:>8.4f}{trained}")
+            print(f"\n  * = trained distance")
+    else:
+        print("\n  Skip head: not available")
+    return {"context_utilization": ctx_results, "skip_accuracy": skip_results}
+# ──────────────────────────────────────────────────────────────────────
+# Tier 3: Semantic Consistency
+# ──────────────────────────────────────────────────────────────────────
+def compute_chunk_similarity(hidden_states, chunk_size=128):
+    """Compute cosine similarity between chunks of hidden states.
+    hidden_states: [L, D] tensor.
+    """
+    L, D = hidden_states.shape
+    n_chunks = L // chunk_size
+    if n_chunks < 2:
+        return None
+    # Mean-pool each chunk
+    chunks = []
+    for i in range(n_chunks):
+        chunk = hidden_states[i * chunk_size:(i + 1) * chunk_size]
+        chunks.append(chunk.mean(dim=0))
+    chunk_vecs = torch.stack(chunks)
+    chunk_vecs = F.normalize(chunk_vecs, dim=-1)
+    # Pairwise cosine similarity
+    sim_matrix = chunk_vecs @ chunk_vecs.T
+    # Upper triangle (excluding diagonal)
+    mask = torch.triu(torch.ones_like(sim_matrix, dtype=torch.bool), diagonal=1)
+    pairwise_sims = sim_matrix[mask]
+    # Adjacent pairs
+    adjacent = [sim_matrix[i, i + 1].item() for i in range(n_chunks - 1)]
+    # Distant pairs (first quarter vs last quarter)
+    q1 = max(1, n_chunks // 4)
+    distant = []
+    for i in range(q1):
+        for j in range(n_chunks - q1, n_chunks):
+            if i < j:
+                distant.append(sim_matrix[i, j].item())
+    return {
+        "mean_sim": pairwise_sims.mean().item(),
+        "min_sim": pairwise_sims.min().item(),
+        "adjacent_sim": sum(adjacent) / len(adjacent),
+        "distant_sim": sum(distant) / len(distant) if distant else 0.0,
+        "n_chunks": n_chunks,
+    }
+def eval_consistency(wrapper, generations, chunk_size=128):
+    """Tier 3: Semantic consistency of generated text via hidden state similarity."""
+    print("\n" + "=" * 60)
+    print("TIER 3: Semantic Consistency")
+    print("=" * 60)
+    all_metrics = []
+    for i, (prompt, prompt_ids, gen_ids) in enumerate(generations):
+        if gen_ids.shape[0] < chunk_size * 2:
+            continue
+        # Build full sequence: prompt + generated
+        full_ids = torch.cat([prompt_ids, gen_ids]).unsqueeze(0).to(wrapper.device)
+        # Trim to max_seq_len
+        if full_ids.shape[1] > wrapper.max_seq_len:
+            full_ids = full_ids[:, :wrapper.max_seq_len]
+        _, hidden, _ = wrapper.forward_with_hidden(full_ids)
+        # Use only generated part's hidden states
+        gen_start = prompt_ids.shape[0]
+        gen_hidden = hidden[0, gen_start:]  # [gen_len, D]
+        metrics = compute_chunk_similarity(gen_hidden, chunk_size)
+        if metrics is not None:
+            metrics["prompt"] = prompt
+            all_metrics.append(metrics)
+        print(f"  [{len(all_metrics)}/{len(generations)}]", end="\r")
+    print()
+    if not all_metrics:
+        print("  No valid generations for consistency evaluation.")
+        return {}
+    n = len(all_metrics)
+    agg = {}
+    for key in ["mean_sim", "min_sim", "adjacent_sim", "distant_sim"]:
+        values = [m[key] for m in all_metrics]
+        agg[key] = {"mean": sum(values) / n, "min": min(values), "max": max(values)}
+    # Topic drift: how much similarity drops from adjacent to distant chunks
+    drift_vals = [m["adjacent_sim"] - m["distant_sim"] for m in all_metrics]
+    agg["topic_drift"] = {"mean": sum(drift_vals) / n,
+                          "min": min(drift_vals), "max": max(drift_vals)}
+    # Print
+    print(f"\n  Generations evaluated: {n}")
+    print(f"  Chunk size: {chunk_size} tokens")
+    avg_chunks = sum(m["n_chunks"] for m in all_metrics) / n
+    print(f"  Avg chunks per generation: {avg_chunks:.1f}")
+    print()
+    print(f"  {'Metric':<24} {'Mean':>8} {'Min':>8} {'Max':>8}")
+    print(f"  {'-' * 50}")
+    for key in ["mean_sim", "min_sim", "adjacent_sim", "distant_sim", "topic_drift"]:
+        m = agg[key]
+        print(f"  {key:<24} {m['mean']:>8.3f} {m['min']:>8.3f} {m['max']:>8.3f}")
+    return {"per_prompt": all_metrics, "aggregate": agg}
+# ──────────────────────────────────────────────────────────────────────
+# Summary
+# ──────────────────────────────────────────────────────────────────────
+def print_summary(results):
+    """Print composite summary scores."""
+    print("\n" + "=" * 60)
+    print("SUMMARY")
+    print("=" * 60)
+    scores = {}
+    # Diversity score: mean unique-4gram
+    t1 = results.get("tier1_diversity", {})
+    if t1 and "aggregate" in t1:
+        div_score = t1["aggregate"].get("unique_4g", {}).get("mean", None)
+        collapse = t1["aggregate"].get("collapse_rate", None)
+        if div_score is not None:
+            scores["diversity"] = div_score
+            print(f"  Diversity (unique 4-gram):   {div_score:.3f}", end="")
+            if collapse is not None:
+                print(f"  (collapse: {collapse:.0%})", end="")
+            print()
+    # Context utilization ratio
+    t2 = results.get("tier2_structural", {})
+    if t2:
+        ctx = t2.get("context_utilization")
+        if ctx:
+            buckets = list(ctx.values())
+            if len(buckets) >= 2:
+                ratio = buckets[0]["ppl"] / buckets[-1]["ppl"]
+                scores["context_util"] = ratio
+                print(f"  Context utilization:         {ratio:.2f}x")
+        skip = t2.get("skip_accuracy")
+        if skip:
+            # Report accuracy at trained distance
+            trained_key = None
+            for key in skip:
+                trained_key = key  # use first available
+                break
+            if trained_key:
+                top5 = skip[trained_key]["top5"]
+                scores["skip_top5"] = top5
+                print(f"  Skip accuracy ({trained_key} top-5):  {top5:.4f}")
+    # Coherence score: mean chunk similarity
+    t3 = results.get("tier3_consistency", {})
+    if t3 and "aggregate" in t3:
+        coh_score = t3["aggregate"].get("mean_sim", {}).get("mean", None)
+        drift = t3["aggregate"].get("topic_drift", {}).get("mean", None)
+        if coh_score is not None:
+            scores["coherence"] = coh_score
+            print(f"  Coherence (chunk sim):       {coh_score:.3f}", end="")
+            if drift is not None:
+                print(f"  (drift: {drift:.3f})", end="")
+            print()
+    results["summary"] = scores
+    return scores
+# ──────────────────────────────────────────────────────────────────────
+# Main
+# ──────────────────────────────────────────────────────────────────────
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Coherence evaluation for language models",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    # Model source (mutually exclusive)
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument("--checkpoint", type=str, help="Path to circuit model checkpoint")
+    group.add_argument("--model", type=str, help="HuggingFace model name or path")
+    # Evaluation config
+    parser.add_argument("--prompts", type=str, help="File with prompts (one per line)")
+    parser.add_argument("--num-prompts", type=int, default=20,
+                        help="Number of prompts to use (default: 20)")
+    parser.add_argument("--gen-length", type=int, default=512,
+                        help="Tokens to generate per prompt (default: 512)")
+    parser.add_argument("--eval-data", type=str,
+                        help="Text file for Tier 2 (default: WikiText-103 validation)")
+    parser.add_argument("--num-sequences", type=int, default=50,
+                        help="Number of sequences for Tier 2 (default: 50)")
+    parser.add_argument("--chunk-size", type=int, default=128,
+                        help="Chunk size for Tier 3 similarity (default: 128)")
+    parser.add_argument("--distances", type=str, default="2,5,10,25,50,100",
+                        help="Skip distances for Tier 2b (default: 2,5,10,25,50,100)")
+    parser.add_argument("--tiers", type=str, default="1,2,3",
+                        help="Which tiers to run (default: 1,2,3)")
+    # Hardware
+    parser.add_argument("--gpu", type=int, default=0, help="GPU index (default: 0)")
+    # Output
+    parser.add_argument("--output", type=str, help="Save results to JSON file")
+    parser.add_argument("--samples", type=int, default=3,
+                        help="Number of sample generations to display (default: 3)")
+    return parser.parse_args()
+def main():
+    args = parse_args()
+    device = torch.device(f"cuda:{args.gpu}" if torch.cuda.is_available() else "cpu")
+    tiers = [int(t) for t in args.tiers.split(",")]
+    distances = [int(d) for d in args.distances.split(",")]
+    # Load model
+    print("=" * 60)
+    print("Coherence Evaluation")
+    print("=" * 60)
+    if args.checkpoint:
+        print(f"Loading: {args.checkpoint}")
+        wrapper = ModelWrapper.from_checkpoint(args.checkpoint, device)
+    else:
+        print(f"Loading: {args.model}")
+        wrapper = ModelWrapper.from_pretrained(args.model, device)
+    print(f"Model: {wrapper.name}")
+    print(f"Device: {device}")
+    print(f"Max seq len: {wrapper.max_seq_len}")
+    if wrapper.has_skip_head:
+        print(f"Skip head: t+{wrapper.skip_k}")
+    print(f"Tiers: {tiers}")
+    # Load prompts
+    if args.prompts:
+        with open(args.prompts) as f:
+            prompts = [line.strip() for line in f if line.strip()]
+    else:
+        prompts = DEFAULT_PROMPTS
+    prompts = prompts[:args.num_prompts]
+    print(f"Prompts: {len(prompts)}")
+    results = {"model": wrapper.name}
+    t0 = time.time()
+    # Generate once for Tier 1 and Tier 3
+    generations = None
+    if 1 in tiers or 3 in tiers:
+        print(f"\nGenerating {args.gen_length} tokens from {len(prompts)} prompts...")
+        generations = generate_all(wrapper, prompts, args.gen_length)
+    # Tier 1
+    if 1 in tiers and generations:
+        results["tier1_diversity"] = eval_diversity(
+            generations, wrapper.tokenizer, show_samples=args.samples)
+    # Tier 2
+    if 2 in tiers:
+        results["tier2_structural"] = eval_structural(
+            wrapper, args.eval_data, distances, args.num_sequences)
+    # Tier 3
+    if 3 in tiers and generations:
+        results["tier3_consistency"] = eval_consistency(
+            wrapper, generations, args.chunk_size)
+    # Summary
+    print_summary(results)
+    elapsed = time.time() - t0
+    print(f"\nTotal time: {elapsed:.0f}s")
+    # Save
+    if args.output:
+        def make_serializable(obj):
+            if isinstance(obj, dict):
+                return {k: make_serializable(v) for k, v in obj.items()}
+            elif isinstance(obj, list):
+                return [make_serializable(v) for v in obj]
+            elif isinstance(obj, torch.Tensor):
+                return obj.tolist()
+            elif isinstance(obj, float):
+                if math.isnan(obj) or math.isinf(obj):
+                    return str(obj)
+            return obj
+        out_path = Path(args.output)
+        out_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(out_path, "w") as f:
+            json.dump(make_serializable(results), f, indent=2)
+        print(f"Results saved to {args.output}")
+if __name__ == "__main__":
+    main()

config.py ADDED Viewed

	@@ -0,0 +1,306 @@

+"""
+Configuration for Circuit Transformer experiments.
+"""
+from dataclasses import dataclass, field
+import argparse
+@dataclass
+class CircuitConfig:
+    """Configuration for CircuitTransformer model and training."""
+    # Model architecture
+    vocab_size: int = 50257  # GPT-2 tokenizer
+    hidden_size: int = 256
+    num_heads: int = 8
+    num_kv_heads: int | None = None  # GQA: None = same as num_heads (MHA)
+    num_layers: int = 6
+    max_seq_len: int = 512
+    dropout: float = 0.0
+    # Training
+    batch_size: int = 32
+    learning_rate: float = 3e-4
+    min_lr: float = 0.0
+    weight_decay: float = 0.1
+    warmup_steps: int = 100
+    epochs: int = 10
+    grad_clip: float = 1.0
+    reset: bool = False
+    # Hardware
+    gpu: int = 0
+    fp16: bool = True
+    bf16: bool = False
+    compile: bool = False
+    # Logging
+    log_every: int = 50
+    save_every: int = 5000
+    checkpoint_dir: str = "./circuits/checkpoints"
+    def __post_init__(self):
+        assert self.hidden_size % self.num_heads == 0, \
+            f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})"
+        if self.num_kv_heads is not None:
+            assert self.num_heads % self.num_kv_heads == 0, \
+                f"num_heads ({self.num_heads}) must be divisible by num_kv_heads ({self.num_kv_heads})"
+    # Presets
+    @classmethod
+    def tiny(cls) -> "CircuitConfig":
+        """~2M params"""
+        return cls(hidden_size=128, num_heads=4, num_layers=4)
+    @classmethod
+    def small(cls) -> "CircuitConfig":
+        """~10M params"""
+        return cls(hidden_size=256, num_heads=8, num_layers=6)
+    @classmethod
+    def medium(cls) -> "CircuitConfig":
+        """~50M params"""
+        return cls(hidden_size=512, num_heads=8, num_layers=12)
+    @classmethod
+    def medium_plus(cls) -> "CircuitConfig":
+        """~50M params"""
+        return cls(hidden_size=512, num_heads=8, num_layers=15)
+    @classmethod
+    def medium_wide_9(cls) -> "CircuitConfig":
+        """~50M params"""
+        return cls(hidden_size=640, num_heads=10, num_layers=9)
+    @classmethod
+    def medium_wide_11(cls) -> "CircuitConfig":
+        """~50M params"""
+        return cls(hidden_size=640, num_heads=10, num_layers=11)
+    @classmethod
+    def medium_large(cls) -> "CircuitConfig":
+        """~90M params"""
+        return cls(hidden_size=768, num_heads=12, num_layers=12)
+    @classmethod
+    def large(cls) -> "CircuitConfig":
+        return cls(hidden_size=1280, num_heads=20, num_layers=11)
+    # Auxiliary objectives
+    aux_skip_k: int = 0        # skip-ahead prediction distance (0 = disabled)
+    aux_skip_weight: float = 0.1  # weight for auxiliary skip loss
+    # Word-position RoPE (SemRoPE)
+    word_rope_dims: int = 0        # head dims for word-position RoPE (0 = disabled)
+    word_rope_base: float = 10.0   # frequency base for word-position RoPE
+    # Factorized embedding / MLP head
+    embed_dim: int = 0             # factorized embedding dim (0 = use hidden_size)
+    head_dim: int = 0              # MLP head intermediate dim (0 = linear head)
+    def to_dict(self) -> dict:
+        """Convert to dictionary for serialization."""
+        d = {
+            "vocab_size": self.vocab_size,
+            "hidden_size": self.hidden_size,
+            "num_heads": self.num_heads,
+            "num_layers": self.num_layers,
+            "max_seq_len": self.max_seq_len,
+            "dropout": self.dropout,
+        }
+        if self.num_kv_heads is not None:
+            d["num_kv_heads"] = self.num_kv_heads
+        if self.aux_skip_k > 0:
+            d["aux_skip_k"] = self.aux_skip_k
+            d["aux_skip_weight"] = self.aux_skip_weight
+        if self.word_rope_dims > 0:
+            d["word_rope_dims"] = self.word_rope_dims
+            d["word_rope_base"] = self.word_rope_base
+        if self.embed_dim > 0:
+            d["embed_dim"] = self.embed_dim
+        if self.head_dim > 0:
+            d["head_dim"] = self.head_dim
+        return d
+    @classmethod
+    def from_dict(cls, d: dict) -> "CircuitConfig":
+        """Create from dictionary."""
+        return cls(**{k: v for k, v in d.items() if k in cls.__dataclass_fields__})
+def parse_args() -> tuple[CircuitConfig, argparse.Namespace]:
+    """Parse CLI arguments and return config + extra args."""
+    parser = argparse.ArgumentParser(description="Circuit Transformer Training")
+    # Data
+    parser.add_argument("--data", type=str, required=True,
+                        help="Data source: path/to/file.txt, path/to/dir/, or hf:dataset_name")
+    parser.add_argument("--text-column", type=str, default="text",
+                        help="Column name for HF datasets (default: text)")
+    parser.add_argument("--data-format", type=str, choices=["text", "chat"], default="text",
+                        help="Data format: text (single column) or chat (system + conversations)")
+    parser.add_argument("--num-samples", type=int, default=None,
+                        help="Limit samples from HF dataset")
+    parser.add_argument("--cache-dir", type=str, default="./circuits/.cache",
+                        help="Cache directory for tokenized data")
+    parser.add_argument("--no-cache", action="store_true",
+                        help="Disable data caching")
+    parser.add_argument("--val-split", type=float, default=0.05,
+                        help="Fraction of data for validation (default: 0.05, 0 to disable)")
+    # Model architecture
+    # TODO: Remove `slot_mirrored`
+    parser.add_argument("--arch", type=str, choices=["standard", "mirrored", "graft_g2lu"], default="standard",
+                        help="Model architecture (default: standard)")
+    parser.add_argument("--preset", type=str, choices=["tiny", "small", "medium", "medium_plus", "medium_large", "medium_wide_9", "medium_wide_11", "large"],
+                        help="Use preset configuration")
+    parser.add_argument("--dims", type=int, default=None, help="Hidden size")
+    parser.add_argument("--layers", type=int, default=None, help="Number of layers")
+    parser.add_argument("--heads", type=int, default=None, help="Number of attention heads")
+    parser.add_argument("--kv-heads", type=int, default=None,
+                        help="Number of KV heads for GQA (default: same as --heads for MHA)")
+    parser.add_argument("--context-length", type=int, default=None, help="Max sequence length")
+    parser.add_argument("--dropout", type=float, default=None, help="Dropout rate")
+    parser.add_argument("--tokenizer", type=str, default="gpt2",
+                        help="Tokenizer to use (default: gpt2, e.g. facebook/MobileLLM-125M)")
+    # Mirrored architecture specific
+    parser.add_argument("--n-middle", type=int, default=2,
+                        help="Unique middle layers for mirrored arch (default: 2)")
+    parser.add_argument("--share-attention", action="store_true", default=True,
+                        help="Share attention weights between mirror pairs (default)")
+    parser.add_argument("--no-share-attention", dest="share_attention", action="store_false",
+                        help="Separate attention weights per direction")
+    # G²LU gating
+    parser.add_argument("--no-g2lu", action="store_true",
+                        help="Disable G²LU (use vanilla SwiGLU in mirrored arch)")
+    # Auxiliary objectives
+    parser.add_argument("--aux-skip", type=int, default=0,
+                        help="Skip-ahead prediction distance (0 = disabled, e.g. 5 predicts t+5)")
+    parser.add_argument("--aux-weight", type=float, default=0.1,
+                        help="Weight for auxiliary skip loss (default: 0.1)")
+    # Word-position RoPE (SemRoPE)
+    parser.add_argument("--word-rope-dims", type=int, default=0,
+                        help="Head dims dedicated to word-position RoPE (0=disabled, try 8 or 16)")
+    parser.add_argument("--word-rope-base", type=float, default=10.0,
+                        help="Frequency base for word-position RoPE (default: 10.0)")
+    # Factorized embedding / MLP head
+    parser.add_argument("--embed-dim", type=int, default=0,
+                        help="Factorized embedding dim (0=use hidden_size, e.g. 256)")
+    parser.add_argument("--head-dim", type=int, default=0,
+                        help="MLP head intermediate dim (0=linear head, e.g. 512)")
+    # G²LU gate grafting
+    parser.add_argument("--pretrained", type=str, default=None,
+                        help="HuggingFace model for graft_g2lu (e.g. meta-llama/Llama-3.2-1B)")
+    parser.add_argument("--align-weight", type=float, default=1.0,
+                        help="Alignment loss weight for G²LU grafting (default: 1.0)")
+    parser.add_argument("--graft-warmup", type=int, default=500,
+                        help="Blend warmup steps: SwiGLU→G²LU transition (default: 500)")
+    # Training
+    parser.add_argument("--epochs", type=int, default=None)
+    parser.add_argument("--batch-size", type=int, default=None)
+    parser.add_argument("--lr", type=float, default=None, help="Learning rate")
+    parser.add_argument("--min-lr", type=float, default=None,
+                        help="Minimum learning rate for cosine decay (default: 0)")
+    parser.add_argument("--weight-decay", type=float, default=None)
+    parser.add_argument("--warmup-steps", type=int, default=None)
+    parser.add_argument("--grad-clip", type=float, default=None)
+    parser.add_argument("--grad-accum", type=int, default=1,
+                        help="Gradient accumulation steps (effective batch = batch_size * grad_accum)")
+    # Hardware
+    parser.add_argument("--gpu", type=int, default=0)
+    parser.add_argument("--fp16", action="store_true", help="Use FP16 mixed precision (with GradScaler)")
+    parser.add_argument("--bf16", action="store_true", help="Use BF16 mixed precision (no scaler needed)")
+    parser.add_argument("--no-fp16", action="store_true", help="Disable mixed precision (FP32)")
+    parser.add_argument("--compile", action="store_true", help="Use torch.compile")
+    # Logging/Checkpointing
+    parser.add_argument("--log-every", type=int, default=None)
+    parser.add_argument("--save-every", type=int, default=None)
+    parser.add_argument("--val-every", type=int, default=0,
+                        help="Run validation every N steps (0 = only at epoch end)")
+    parser.add_argument("--checkpoint-dir", type=str, default=None)
+    parser.add_argument("--resume", type=str, default=None, help="Resume from checkpoint")
+    parser.add_argument("--reset", action="store_true", default=False, help="When resuming the training resets steps and optimizers")
+    args = parser.parse_args()
+    # Build config from preset or defaults
+    if args.preset:
+        config = getattr(CircuitConfig, args.preset)()
+    else:
+        config = CircuitConfig()
+    # Override with explicit args
+    if args.dims is not None:
+        config.hidden_size = args.dims
+    if args.layers is not None:
+        config.num_layers = args.layers
+    if args.heads is not None:
+        config.num_heads = args.heads
+    if args.kv_heads is not None:
+        config.num_kv_heads = args.kv_heads
+    if args.context_length is not None:
+        config.max_seq_len = args.context_length
+    if args.dropout is not None:
+        config.dropout = args.dropout
+    if args.epochs is not None:
+        config.epochs = args.epochs
+    if args.batch_size is not None:
+        config.batch_size = args.batch_size
+    if args.lr is not None:
+        config.learning_rate = args.lr
+    if args.min_lr is not None:
+        config.min_lr = args.min_lr
+    if args.weight_decay is not None:
+        config.weight_decay = args.weight_decay
+    if args.warmup_steps is not None:
+        config.warmup_steps = args.warmup_steps
+    if args.grad_clip is not None:
+        config.grad_clip = args.grad_clip
+    if args.log_every is not None:
+        config.log_every = args.log_every
+    if args.save_every is not None:
+        config.save_every = args.save_every
+    if args.checkpoint_dir is not None:
+        config.checkpoint_dir = args.checkpoint_dir
+    # Auxiliary objectives
+    if args.aux_skip > 0:
+        config.aux_skip_k = args.aux_skip
+        config.aux_skip_weight = args.aux_weight
+    # Word-position RoPE
+    if args.word_rope_dims > 0:
+        config.word_rope_dims = args.word_rope_dims
+        config.word_rope_base = args.word_rope_base
+    # Factorized embedding / MLP head
+    if args.embed_dim > 0:
+        config.embed_dim = args.embed_dim
+    if args.head_dim > 0:
+        config.head_dim = args.head_dim
+    config.gpu = args.gpu
+    if args.bf16:
+        config.bf16 = True
+        config.fp16 = False
+    elif args.no_fp16:
+        config.fp16 = False
+        config.bf16 = False
+    elif args.fp16:
+        config.fp16 = True
+        config.bf16 = False
+    config.compile = args.compile
+    config.reset = args.reset
+    return config, args

data.py ADDED Viewed

	@@ -0,0 +1,546 @@

+"""
+Data loading utilities for Circuit Transformer.
+Supports:
+- Single text file: --data path/to/file.txt
+- Directory of text files: --data path/to/dir/
+- HuggingFace dataset: --data hf:dataset_name
+Caching:
+- HF datasets: memory-mapped binary files (.bin) — O(1) RAM
+- Text files: torch .pt files (legacy, in-memory)
+- Cache location: ./circuits/.cache/ (or custom via cache_dir)
+Parallelism:
+- HF datasets tokenized via dataset.map(num_proc=N) — multiprocessing, bypasses GIL
+- Fast tokenizer uses Rust internally — additional parallelism within each worker
+"""
+import os
+import struct
+import hashlib
+import multiprocessing
+from pathlib import Path
+import numpy as np
+import torch
+from torch.utils.data import Dataset, DataLoader
+DEFAULT_CACHE_DIR = "./circuits/.cache"
+# Memmap binary format:
+#   Header: 8 bytes = [uint32 n_chunks, uint32 max_seq_len]
+#   Data:   n_chunks * max_seq_len * 4 bytes (int32, row-major)
+HEADER_SIZE = 8
+# ---------------------------------------------------------------------------
+# Cache utilities
+# ---------------------------------------------------------------------------
+def _cache_key(data_source: str, max_seq_len: int, num_samples: int | None) -> str:
+    """Generate cache filename from parameters."""
+    key_str = f"{data_source}|{max_seq_len}|{num_samples}"
+    hash_val = hashlib.md5(key_str.encode()).hexdigest()[:12]
+    name = data_source.replace("/", "_").replace(":", "_").replace(".", "_")[-30:]
+    return f"{name}_{max_seq_len}_{hash_val}.bin"
+# ---------------------------------------------------------------------------
+# Dataset classes
+# ---------------------------------------------------------------------------
+class MemmapDataset(Dataset):
+    """Dataset backed by memory-mapped binary file. O(1) RAM regardless of size."""
+    def __init__(self, path, start=None, end=None):
+        self.path = str(path)
+        with open(self.path, 'rb') as f:
+            total, self.max_seq_len = struct.unpack('II', f.read(HEADER_SIZE))
+        self._total = total
+        self.data = np.memmap(
+            self.path, dtype=np.int32, mode='r',
+            offset=HEADER_SIZE, shape=(total, self.max_seq_len),
+        )
+        self.start = start if start is not None else 0
+        self.end = end if end is not None else total
+    def __len__(self):
+        return self.end - self.start
+    def __getitem__(self, idx):
+        tokens = torch.from_numpy(self.data[self.start + idx].copy()).long()
+        return {"input_ids": tokens, "labels": tokens.clone()}
+    def split(self, val_fraction=0.1):
+        """Split into (train, val) datasets. Both share the same memmap file."""
+        total = self.end - self.start
+        n_val = max(1, int(total * val_fraction))
+        train = MemmapDataset(self.path, self.start, self.end - n_val)
+        val = MemmapDataset(self.path, self.end - n_val, self.end)
+        return train, val
+class TextDataset(Dataset):
+    """Simple in-memory dataset from tokenized chunks. For small datasets."""
+    def __init__(self, token_chunks: list[list[int]], max_seq_len: int):
+        self.chunks = token_chunks
+        self.max_seq_len = max_seq_len
+    def __len__(self) -> int:
+        return len(self.chunks)
+    def __getitem__(self, idx: int) -> dict[str, torch.Tensor]:
+        tokens = self.chunks[idx]
+        if len(tokens) < self.max_seq_len:
+            tokens = tokens + [0] * (self.max_seq_len - len(tokens))
+        else:
+            tokens = tokens[: self.max_seq_len]
+        input_ids = torch.tensor(tokens, dtype=torch.long)
+        return {"input_ids": input_ids, "labels": input_ids.clone()}
+    def split(self, val_fraction=0.1):
+        """Split into (train, val) datasets with shuffle."""
+        import random
+        random.shuffle(self.chunks)
+        n_val = max(1, int(len(self.chunks) * val_fraction))
+        val = TextDataset(self.chunks[:n_val], self.max_seq_len)
+        train = TextDataset(self.chunks[n_val:], self.max_seq_len)
+        return train, val
+# ---------------------------------------------------------------------------
+# Tokenizer
+# ---------------------------------------------------------------------------
+class _SentencePieceTokenizer:
+    """Minimal tokenizer wrapper using sentencepiece directly.
+    Bypasses transformers tokenizer bugs across versions."""
+    def __init__(self, model_path, name):
+        import sentencepiece as spm
+        self.sp = spm.SentencePieceProcessor()
+        self.sp.Load(model_path)
+        self._vocab_size = self.sp.GetPieceSize()
+        self.eos_token_id = self.sp.eos_id()
+        self.bos_token_id = self.sp.bos_id()
+        self.eos_token = self.sp.IdToPiece(self.eos_token_id)
+        self.bos_token = self.sp.IdToPiece(self.bos_token_id)
+        self.pad_token = None
+        self.pad_token_id = None
+        self.name_or_path = name
+    def __len__(self):
+        return self._vocab_size
+    @property
+    def vocab_size(self):
+        return self._vocab_size
+    def encode(self, text, add_special_tokens=False, return_tensors=None):
+        ids = self.sp.Encode(text)
+        if return_tensors == "pt":
+            import torch
+            return torch.tensor([ids])
+        return ids
+    def decode(self, ids, skip_special_tokens=False):
+        if hasattr(ids, 'tolist'):
+            ids = ids.tolist()
+        return self.sp.Decode(list(ids))
+    def __call__(self, texts, add_special_tokens=False, **kwargs):
+        if isinstance(texts, str):
+            texts = [texts]
+        return {"input_ids": [self.sp.Encode(t) for t in texts]}
+def get_tokenizer(name: str = "gpt2"):
+    """Get tokenizer from HuggingFace, with sentencepiece fallback.
+    Args:
+        name: Tokenizer name or path. Default "gpt2" (50257 vocab).
+              Use e.g. "facebook/MobileLLM-125M" for 32K vocab.
+    """
+    from transformers import AutoTokenizer
+    # Try AutoTokenizer (fast then slow)
+    for use_fast in (True, False):
+        try:
+            tokenizer = AutoTokenizer.from_pretrained(name, use_fast=use_fast,
+                                                      trust_remote_code=True)
+            if isinstance(tokenizer, bool):
+                continue
+            if tokenizer.pad_token is None:
+                tokenizer.pad_token = tokenizer.eos_token
+            return tokenizer
+        except Exception:
+            continue
+    # Fallback: load sentencepiece model directly (bypasses transformers bugs)
+    print(f"AutoTokenizer failed for {name}, falling back to sentencepiece")
+    from huggingface_hub import hf_hub_download
+    model_path = hf_hub_download(name, "tokenizer.model")
+    tokenizer = _SentencePieceTokenizer(model_path, name)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+        tokenizer.pad_token_id = tokenizer.eos_token_id
+    return tokenizer
+# ---------------------------------------------------------------------------
+# Streaming memmap writer
+# ---------------------------------------------------------------------------
+def _stream_chunks_to_memmap(tokenized, total_examples, max_seq_len, output_path,
+                             num_workers=1, read_batch=10_000):
+    """Stream tokenized examples into a memory-mapped binary file.
+    Single-process, numpy-batch approach. Reads batches from Arrow dataset,
+    flattens to numpy int32, writes complete chunks to disk.
+    Memory: O(read_batch * avg_seq_len * 4 bytes).
+    No fork, no multiprocessing, no OOM.
+    """
+    from itertools import chain
+    from tqdm import tqdm
+    temp_path = str(output_path) + ".tmp"
+    n_chunks = 0
+    total_tokens = 0
+    carryover = np.array([], dtype=np.int32)
+    n_batches = (total_examples + read_batch - 1) // read_batch
+    with open(temp_path, 'wb') as f:
+        f.write(struct.pack('II', 0, max_seq_len))  # placeholder header
+        for batch_start in tqdm(range(0, total_examples, read_batch),
+                                total=n_batches, desc="Chunking",
+                                mininterval=1.0):
+            batch_end = min(batch_start + read_batch, total_examples)
+            batch_ids = tokenized[batch_start:batch_end]["input_ids"]
+            # Count tokens, flatten Arrow→numpy without intermediate Python list
+            n_tok = sum(len(ids) for ids in batch_ids if ids)
+            if n_tok == 0:
+                del batch_ids
+                continue
+            flat = np.fromiter(
+                chain.from_iterable(ids for ids in batch_ids if ids),
+                dtype=np.int32, count=n_tok,
+            )
+            del batch_ids
+            total_tokens += n_tok
+            # Prepend carryover from previous batch
+            if len(carryover) > 0:
+                flat = np.concatenate([carryover, flat])
+            # Write complete chunks
+            n_complete = len(flat) // max_seq_len
+            if n_complete > 0:
+                f.write(flat[:n_complete * max_seq_len].tobytes())
+                n_chunks += n_complete
+            carryover = flat[n_complete * max_seq_len:].copy()
+            del flat
+        # Handle remaining tokens
+        if len(carryover) >= 32:
+            padded = np.zeros(max_seq_len, dtype=np.int32)
+            padded[:len(carryover)] = carryover
+            f.write(padded.tobytes())
+            n_chunks += 1
+        # Write actual count into header
+        f.seek(0)
+        f.write(struct.pack('II', n_chunks, max_seq_len))
+    os.rename(temp_path, str(output_path))
+    size_gb = os.path.getsize(output_path) / 1e9
+    print(f"Total tokens: {total_tokens:,} → {n_chunks:,} chunks ({size_gb:.1f} GB)")
+    return n_chunks
+# ---------------------------------------------------------------------------
+# HuggingFace dataset loader (parallel + memmap)
+# ---------------------------------------------------------------------------
+def _flatten_chat(example):
+    """Convert chat format (system + conversations list) to plain text.
+    Handles datasets like Bespoke-Stratos-17k and OpenThoughts-114k
+    which store data as: system (str) + conversations (list of {from, value}).
+    """
+    parts = []
+    if example.get("system"):
+        parts.append(example["system"].strip())
+    for msg in example.get("conversations", []):
+        value = msg.get("value", "")
+        if value:
+            parts.append(value.strip())
+    return {"text": "\n\n".join(parts)}
+def _estimate_avg_chars(dataset, text_column: str, n_sample: int = 200) -> float:
+    """Estimate average text length from a sample of the dataset."""
+    n = min(n_sample, len(dataset))
+    total = sum(len(dataset[i][text_column] or "") for i in range(n))
+    return total / max(n, 1)
+def _adaptive_params(avg_chars: float, n_examples: int):
+    """Scale worker count, batch sizes based on average example length.
+    Long examples (chain-of-thought reasoning) need smaller batches and fewer
+    workers to avoid OOM on memory-constrained systems (especially WSL).
+    """
+    cpu_count = max(1, multiprocessing.cpu_count() - 1)
+    if avg_chars > 20_000:       # very long (OpenThoughts-style, ~7K+ tokens)
+        num_proc = min(cpu_count, 4)
+        tok_batch = 64
+        read_batch = 500
+    elif avg_chars > 5_000:      # long (detailed SFT, ~1.5K+ tokens)
+        num_proc = min(cpu_count, 8)
+        tok_batch = 256
+        read_batch = 2_000
+    elif avg_chars > 1_000:      # medium (typical SFT)
+        num_proc = min(cpu_count, 16)
+        tok_batch = 500
+        read_batch = 5_000
+    else:                        # short (web text, wiki)
+        num_proc =  min(cpu_count, 32)
+        tok_batch = 1000
+        read_batch = 10_000
+    return num_proc, tok_batch, read_batch
+def load_hf_dataset(
+    name: str,
+    split: str,
+    text_column: str,
+    tokenizer,
+    max_seq_len: int,
+    num_samples: int | None = None,
+    hf_config: str | None = None,
+    cache_path: Path | None = None,
+    data_format: str = "text",
+) -> MemmapDataset:
+    """Load HF dataset with parallel tokenization and streaming to memmap.
+    Parallelism:
+    - dataset.map(num_proc=N) uses multiprocessing — bypasses GIL
+    - GPT2TokenizerFast runs Rust tokenization — bypasses GIL
+    - batched=True enables efficient batch processing
+    Memory:
+    - Adaptive batch sizes based on avg example length — prevents OOM on long sequences
+    - Tokenized data in Arrow format (memory-mapped by HuggingFace)
+    - Chunks streamed to binary memmap file — never in RAM
+    """
+    from datasets import load_dataset
+    config_str = f", config={hf_config}" if hf_config else ""
+    print(f"Loading HF dataset: {name} (split={split}{config_str})")
+    dataset = load_dataset(name, hf_config, split=split)
+    if num_samples is not None:
+        dataset = dataset.select(range(min(num_samples, len(dataset))))
+    # Flatten chat format to plain text
+    if data_format == "chat":
+        # Use conservative parallelism for flattening — light operation
+        flat_proc = min(max(1, multiprocessing.cpu_count() - 1), 8)
+        print(f"Flattening {len(dataset):,} chat examples to plain text...")
+        dataset = dataset.map(
+            _flatten_chat,
+            num_proc=flat_proc,
+            remove_columns=dataset.column_names,
+            desc="Flattening chat",
+        )
+        text_column = "text"
+    # Estimate avg example length and adapt parameters
+    avg_chars = _estimate_avg_chars(dataset, text_column)
+    num_proc, tok_batch, read_batch = _adaptive_params(avg_chars, len(dataset))
+    print(f"  Avg example length: ~{avg_chars:,.0f} chars → "
+          f"{num_proc} workers, tok_batch={tok_batch}, read_batch={read_batch}")
+    # Filter empty examples
+    print(f"Filtering empty examples from {len(dataset):,}...")
+    dataset = dataset.filter(
+        lambda x: bool(x[text_column] and x[text_column].strip()),
+        num_proc=num_proc,
+        desc="Filtering",
+    )
+    print(f"  {len(dataset):,} non-empty examples")
+    # Parallel tokenization
+    print(f"Tokenizing {len(dataset):,} examples with {num_proc} workers...")
+    def tokenize_batch(examples):
+        return tokenizer(examples[text_column], add_special_tokens=False)
+    tokenized = dataset.map(
+        tokenize_batch,
+        batched=True,
+        batch_size=tok_batch,
+        num_proc=num_proc,
+        remove_columns=dataset.column_names,
+        desc="Tokenizing",
+    )
+    # Stream to memmap — use temp path if no cache configured
+    if cache_path is None:
+        import tempfile
+        cache_path = Path(tempfile.mktemp(suffix='.bin'))
+    _stream_chunks_to_memmap(tokenized, len(tokenized), max_seq_len, cache_path,
+                             read_batch=read_batch)
+    return MemmapDataset(cache_path)
+# ---------------------------------------------------------------------------
+# Text file loaders (unchanged — small datasets, in-memory is fine)
+# ---------------------------------------------------------------------------
+def tokenize_text(text: str, tokenizer, max_seq_len: int) -> list[list[int]]:
+    """Tokenize text into chunks of max_seq_len."""
+    tokens = tokenizer.encode(text)
+    chunks = []
+    for i in range(0, len(tokens), max_seq_len):
+        chunk = tokens[i : i + max_seq_len]
+        if len(chunk) >= 32:
+            chunks.append(chunk)
+    return chunks
+def load_text_file(path: str, tokenizer, max_seq_len: int) -> list[list[int]]:
+    """Load and tokenize a single text file."""
+    with open(path, "r", encoding="utf-8") as f:
+        text = f.read()
+    return tokenize_text(text, tokenizer, max_seq_len)
+def load_text_directory(path: str, tokenizer, max_seq_len: int) -> list[list[int]]:
+    """Load and tokenize all .txt files from a directory."""
+    all_chunks = []
+    path = Path(path)
+    for txt_file in sorted(path.glob("**/*.txt")):
+        chunks = load_text_file(str(txt_file), tokenizer, max_seq_len)
+        all_chunks.extend(chunks)
+    return all_chunks
+# ---------------------------------------------------------------------------
+# Main entry point
+# ---------------------------------------------------------------------------
+def load_data(
+    data_source: str,
+    tokenizer,
+    max_seq_len: int,
+    text_column: str = "text",
+    num_samples: int | None = None,
+    cache_dir: str | None = DEFAULT_CACHE_DIR,
+    data_format: str = "text",
+) -> Dataset:
+    """
+    Load data from various sources. Returns a Dataset with .split() support.
+    Args:
+        data_source: Path or HF dataset identifier
+            - "path/to/file.txt" — single file
+            - "path/to/dir/"     — directory of .txt files
+            - "hf:dataset_name"  — HuggingFace dataset (train split)
+            - "hf:dataset:split" — HuggingFace with specific split
+            - "hf:dataset:config:split" — with config and split
+        tokenizer: Tokenizer to use
+        max_seq_len: Maximum sequence length
+        text_column: Column name for HF datasets
+        num_samples: Limit samples from HF dataset
+        cache_dir: Directory for cache files (None to disable)
+    Returns:
+        Dataset object supporting len(), __getitem__(), and split(fraction)
+    """
+    cache_path = None
+    if cache_dir is not None:
+        cache_path = Path(cache_dir) / _cache_key(data_source, max_seq_len, num_samples)
+        cache_path.parent.mkdir(parents=True, exist_ok=True)
+        # Check for memmap cache (.bin)
+        if cache_path.exists():
+            print(f"Loading from cache: {cache_path}")
+            ds = MemmapDataset(cache_path)
+            print(f"  Loaded {len(ds):,} chunks")
+            return ds
+        # Check for legacy cache (.pt)
+        legacy_path = cache_path.with_suffix('.pt')
+        if legacy_path.exists():
+            print(f"Loading from legacy cache: {legacy_path}")
+            data = torch.load(legacy_path, weights_only=False)
+            chunks = data["chunks"]
+            print(f"  Loaded {len(chunks):,} chunks")
+            return TextDataset(chunks, max_seq_len)
+    # Load and tokenize
+    if data_source.startswith("hf:"):
+        parts = data_source[3:].split(":")
+        name = parts[0]
+        hf_config = None
+        split = "train"
+        if len(parts) == 2:
+            split = parts[1]
+        elif len(parts) == 3:
+            hf_config = parts[1]
+            split = parts[2]
+        return load_hf_dataset(
+            name, split, text_column, tokenizer, max_seq_len,
+            num_samples, hf_config=hf_config, cache_path=cache_path,
+            data_format=data_format,
+        )
+    elif os.path.isfile(data_source):
+        chunks = load_text_file(data_source, tokenizer, max_seq_len)
+    elif os.path.isdir(data_source):
+        chunks = load_text_directory(data_source, tokenizer, max_seq_len)
+    else:
+        raise ValueError(f"Unknown data source: {data_source}")
+    # For text files: save legacy cache
+    if cache_dir is not None:
+        legacy_path = cache_path.with_suffix('.pt')
+        torch.save({"chunks": chunks, "data_source": data_source,
+                     "max_seq_len": max_seq_len, "num_samples": num_samples}, legacy_path)
+        print(f"Saved to cache: {legacy_path}")
+    return TextDataset(chunks, max_seq_len)
+# ---------------------------------------------------------------------------
+# DataLoader factory
+# ---------------------------------------------------------------------------
+def create_dataloader(
+    dataset,
+    batch_size: int,
+    max_seq_len: int = None,
+    shuffle: bool = True,
+    num_workers: int = 0,
+) -> DataLoader:
+    """Create a DataLoader from a Dataset or list of chunks."""
+    if not isinstance(dataset, Dataset):
+        # Legacy compatibility: list of token chunks
+        dataset = TextDataset(dataset, max_seq_len)
+    return DataLoader(
+        dataset,
+        batch_size=batch_size,
+        shuffle=shuffle,
+        num_workers=num_workers,
+        pin_memory=True,
+    )

generate.py ADDED Viewed

	@@ -0,0 +1,195 @@

+#!/usr/bin/env python3
+"""
+Generation script for Circuit Transformer.
+Usage:
+    python circuits/generate.py --checkpoint circuits/checkpoints/latest.pt --prompt "Once upon a time"
+"""
+import argparse
+import torch
+import torch.nn as nn
+from transformers import AutoTokenizer
+from .config import CircuitConfig
+from .model import CircuitTransformer
+from .mirrored import MirroredConfig, MirroredTransformer
+from .graft_g2lu import load_g2lu_model
+from .layers import build_word_start_table
+from .data import get_tokenizer
+def parse_args():
+    parser = argparse.ArgumentParser(description="Generate text with Circuit Transformer")
+    parser.add_argument("--checkpoint", type=str, required=True, help="Path to checkpoint")
+    parser.add_argument("--prompt", type=str, default="", help="Prompt text")
+    parser.add_argument("--max-tokens", type=int, default=100, help="Max tokens to generate")
+    parser.add_argument("--temperature", type=float, default=0.8, help="Sampling temperature")
+    parser.add_argument("--top-k", type=int, default=50, help="Top-k filtering")
+    parser.add_argument("--top-p", type=float, default=0.9, help="Nucleus sampling threshold")
+    parser.add_argument("--repetition-penalty", type=float, default=1.0, help="Repetition penalty (1.0=off, 1.3=default for slot models)")
+    parser.add_argument("--gpu", type=int, default=0, help="GPU index")
+    parser.add_argument("--no-cache", action="store_true", help="Disable KV cache")
+    return parser.parse_args()
+def _migrate_state_dict(state_dict: dict, model: nn.Module) -> dict:
+    """Migrate checkpoint state_dict to match current model architecture.
+    Handles upgrades like SwiGLU → MirroredSwiGLU (dual_gate_middle).
+    """
+    if any(k.startswith("_orig_mod.") for k in state_dict):
+        state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+    model_keys = set(model.state_dict().keys())
+    ckpt_keys = set(state_dict.keys())
+    missing = model_keys - ckpt_keys
+    unexpected = ckpt_keys - model_keys
+    print(unexpected)
+    if not missing and not unexpected:
+        return state_dict  # perfect match, no migration needed
+    migrated = dict(state_dict)
+    migrations = []
+    # SwiGLU → MirroredSwiGLU: w3 → gate_expand (dual_gate_middle upgrade)
+    for key in list(unexpected):
+        if ".ffn.gate_expand.weight" in key:
+            new_key = key.replace(".ffn.gate_expand.weight", ".ffn.w3.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+        if ".ffn.gate_compress.weight" in key:
+            new_key = key.replace(".ffn.gate_compress.weight", ".ffn.w4.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+    if migrations:
+        print(f"State dict migration ({len(migrations)} keys renamed):")
+        for m in migrations:
+            print(m)
+        # Report remaining missing keys (freshly initialized)
+        still_missing = model_keys - set(migrated.keys())
+        if still_missing:
+            print(f"  New parameters (freshly initialized): {len(still_missing)}")
+            for k in sorted(still_missing):
+                print(f"    {k}")
+    return migrated
+def generate():
+    args = parse_args()
+    # Setup device
+    device = torch.device(f"cuda:{args.gpu}" if torch.cuda.is_available() else "cpu")
+    print(f"Device: {device}")
+    # Load checkpoint
+    print(f"Loading checkpoint: {args.checkpoint}")
+    checkpoint = torch.load(args.checkpoint, map_location="cpu", weights_only=False)
+    # Reconstruct config and model based on architecture type
+    model_type = checkpoint.get("model_type", "standard")
+    is_folded = model_type == "folded"
+    if model_type == "graft_g2lu":
+        model = load_g2lu_model(args.checkpoint, device=device)
+        model.eval()
+        pretrained_name = checkpoint.get("pretrained_name", "unknown")
+        print(f"Architecture: G²LU Graft ({pretrained_name}, {len(model.g2lu_mlps)}L)")
+        tokenizer_name = checkpoint.get("tokenizer_name", pretrained_name)
+        tokenizer = get_tokenizer(tokenizer_name)
+    elif is_folded:
+        from grafting.fold_llama import FoldedLlama
+        model = FoldedLlama.load_from_checkpoint(args.checkpoint, device=device)
+        model.eval()
+        fold_cfg = model.config
+        print(f"Architecture: FoldedLlama ({fold_cfg.model_name}, "
+              f"{fold_cfg.n_expand}E+{fold_cfg.n_middle}M+{fold_cfg.n_compress}C)")
+        tokenizer = AutoTokenizer.from_pretrained(fold_cfg.model_name, trust_remote_code=True)
+    else:
+        if model_type == "mirrored":
+            if checkpoint["config"].get("dual_gate_middle"):
+                checkpoint["config"].pop("dual_gate_middle")
+            config = MirroredConfig.from_dict(checkpoint["config"])
+            model = MirroredTransformer(config).to(device)
+            print(f"Architecture: MirroredTransformer ({model.total_virtual_layers} virtual layers)")
+        else:
+            config = CircuitConfig.from_dict(checkpoint["config"])
+            model = CircuitTransformer(config).to(device)
+            print(f"Architecture: CircuitTransformer ({config.num_layers} layers)")
+        # Strip _orig_mod. prefix from torch.compile'd checkpoints
+        state_dict = _migrate_state_dict(checkpoint["model"], model)
+        model.load_state_dict(state_dict)
+        model.eval()
+        tokenizer_name = checkpoint.get("tokenizer_name", "gpt2")
+        tokenizer = get_tokenizer(tokenizer_name)
+    # Build word-position table if model uses SemRoPE
+    word_start_table_device = None
+    if model_type not in ("graft_g2lu", "folded"):
+        ckpt_config = checkpoint.get("config", {})
+        word_rope_dims = ckpt_config.get("word_rope_dims", 0)
+        if word_rope_dims > 0:
+            word_start_table_device = build_word_start_table(tokenizer, len(tokenizer)).to(device)
+            print(f"Word-position RoPE: {word_rope_dims} dims")
+    # Tokenize prompt
+    if args.prompt:
+        prompt_ids = tokenizer.encode(args.prompt, return_tensors="pt").to(device)
+    else:
+        # Start with BOS/EOS token
+        prompt_ids = torch.tensor([[tokenizer.eos_token_id]], device=device)
+    print(f"\nPrompt: {args.prompt or '<empty>'}")
+    print(f"Prompt tokens: {prompt_ids.shape[1]}")
+    print(f"Generating {args.max_tokens} tokens...")
+    print(f"Temperature: {args.temperature}, Top-k: {args.top_k}, Top-p: {args.top_p}")
+    print("-" * 50)
+    # Generate
+    with torch.no_grad():
+        gen_kwargs = dict(
+            max_new_tokens=args.max_tokens,
+            temperature=args.temperature,
+            top_k=args.top_k,
+            top_p=args.top_p,
+            use_cache=not args.no_cache,
+        )
+        if args.repetition_penalty != 1.0:
+            gen_kwargs["repetition_penalty"] = args.repetition_penalty
+        # HF models need do_sample=True for temperature/top_k/top_p
+        if model_type == "graft_g2lu":
+            if args.temperature > 0 and args.temperature != 1.0:
+                gen_kwargs["do_sample"] = True
+            elif args.top_p < 1.0 or args.top_k > 0:
+                gen_kwargs["do_sample"] = True
+        if word_start_table_device is not None:
+            gen_kwargs["word_start_table"] = word_start_table_device
+        output_ids = model.generate(prompt_ids, **gen_kwargs)
+    # Decode and print
+    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
+    print(generated_text)
+    print("-" * 50)
+    print(f"Total tokens: {output_ids.shape[1]}")
+if __name__ == "__main__":
+    generate()

graft_g2lu.py ADDED Viewed

	@@ -0,0 +1,300 @@

+"""
+G²LU Gate Grafting: Surgically upgrade pretrained SwiGLU models to G²LU.
+Takes any HuggingFace model with SwiGLU (gate_proj + up_proj), freezes everything
+except gate weights, adds W4 for nested gating, and trains with alignment + LM loss.
+This is grafting applied to the gate mechanism — the same methodology validated for
+full layer replacement, now targeting the minimum surgical unit.
+Usage:
+    python -m circuits.train --arch graft_g2lu --pretrained meta-llama/Llama-3.2-1B \
+        --align-weight 1.0 --graft-warmup 500 --data hf:Bingsu/openwebtext_20p ...
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from pathlib import Path
+class G2LU_MLP(nn.Module):
+    """Per-layer MLP wrapper that upgrades SwiGLU to G²LU.
+    Holds references to the original gate_proj (W3, frozen), up_proj (W1, frozen),
+    down_proj (W2, frozen), plus a new w4 (zero-initialized, trainable).
+    Gate ordering: silu(W4@x * silu(W3@x)) — the pretrained gate (W3) acts as
+    structural prior, constraining W4 to operate within the feature subspace the
+    pretrained model already deems relevant. W4's gradients are scaled by silu(W3@x),
+    inheriting the pretrained model's feature selection hierarchy.
+    """
+    def __init__(self, original_mlp: nn.Module):
+        super().__init__()
+        # References to original weights (all frozen)
+        self.gate_proj = original_mlp.gate_proj  # W3 — frozen
+        self.up_proj = original_mlp.up_proj       # W1 — frozen
+        self.down_proj = original_mlp.down_proj   # W2 — frozen
+        # New W4: same shape as gate_proj, zero-initialized, matched dtype
+        self.w4 = nn.Linear(
+            self.gate_proj.in_features,
+            self.gate_proj.out_features,
+            bias=self.gate_proj.bias is not None,
+            dtype=self.gate_proj.weight.dtype,
+            device=self.gate_proj.weight.device,
+        )
+        nn.init.zeros_(self.w4.weight)
+        if self.w4.bias is not None:
+            nn.init.zeros_(self.w4.bias)
+        # Blend alpha: 0 = pure SwiGLU, 1 = full G²LU
+        self._alpha = 0.0
+        # Per-layer alignment loss (collected by parent)
+        self._align_loss = None
+    def forward(self, x):
+        # Pretrained gate (frozen W3) — structural prior
+        w3_gate = F.silu(self.gate_proj(x))
+        # G²LU gate: silu(W4@x * silu(W3@x))
+        # W4 modulated BY pretrained knowledge, not the reverse
+        g2lu_gate = F.silu(self.w4(x) * w3_gate)
+        # Blend warmup: smooth transition from SwiGLU → G²LU
+        if self._alpha < 1.0:
+            gate = (1.0 - self._alpha) * w3_gate + self._alpha * g2lu_gate
+        else:
+            gate = g2lu_gate
+        # Per-layer alignment loss (compare against original SwiGLU gate)
+        self._align_loss = F.mse_loss(gate, w3_gate.detach())
+        return self.down_proj(gate * self.up_proj(x))
+class G2LU_GraftedModel(nn.Module):
+    """Full model wrapper that upgrades a pretrained HF model's MLPs to G²LU.
+    Interface matches CircuitTransformer: forward(input_ids, labels=labels) returns
+    {"loss", "logits", "align_loss"}.
+    """
+    def __init__(
+        self,
+        pretrained_name: str,
+        align_weight: float = 1.0,
+        warmup_steps: int = 500,
+        device: str = "cuda",
+        dtype=torch.bfloat16,
+    ):
+        super().__init__()
+        self.pretrained_name = pretrained_name
+        self.align_weight = align_weight
+        self.warmup_steps = warmup_steps
+        self._current_step = 0
+        # Load pretrained HF model
+        from transformers import AutoModelForCausalLM
+        self.model = AutoModelForCausalLM.from_pretrained(
+            pretrained_name,
+            dtype=dtype,
+            trust_remote_code=True,
+        )
+        # Discover and replace MLPs
+        self.g2lu_mlps = []
+        self._replace_mlps()
+        # Freeze everything, then selectively unfreeze W4 only
+        for param in self.model.parameters():
+            param.requires_grad = False
+        for g2lu in self.g2lu_mlps:
+            for param in g2lu.w4.parameters():
+                param.requires_grad = True
+        self.model.to(device)
+        # Print summary
+        total_params = sum(p.numel() for p in self.model.parameters())
+        trainable = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
+        print(f"G²LU Graft: {pretrained_name}")
+        print(f"  Layers upgraded: {len(self.g2lu_mlps)}")
+        print(f"  Total params: {total_params:,} ({total_params/1e6:.1f}M)")
+        print(f"  Trainable params: {trainable:,} ({trainable/1e6:.1f}M, {100*trainable/total_params:.1f}%)")
+        print(f"  Align weight: {align_weight}, Warmup: {warmup_steps} steps")
+    def _replace_mlps(self):
+        """Walk the model tree and replace SwiGLU MLPs with G²LU wrappers."""
+        # Try common decoder layer paths
+        layers = None
+        for attr_path in ["model.layers", "gpt_neox.layers", "transformer.h"]:
+            obj = self.model
+            try:
+                for attr in attr_path.split("."):
+                    obj = getattr(obj, attr)
+                layers = obj
+                break
+            except AttributeError:
+                continue
+        if layers is None:
+            raise ValueError(
+                f"Could not find decoder layers in {type(self.model).__name__}. "
+                f"Tried: model.layers, gpt_neox.layers, transformer.h"
+            )
+        for i, layer in enumerate(layers):
+            # Try common MLP attribute names
+            mlp = None
+            mlp_attr = None
+            for attr in ["mlp", "feed_forward"]:
+                if hasattr(layer, attr):
+                    mlp = getattr(layer, attr)
+                    mlp_attr = attr
+                    break
+            if mlp is None:
+                continue
+            # Check for SwiGLU signature (gate_proj + up_proj)
+            if hasattr(mlp, "gate_proj") and hasattr(mlp, "up_proj"):
+                g2lu = G2LU_MLP(mlp)
+                setattr(layer, mlp_attr, g2lu)
+                self.g2lu_mlps.append(g2lu)
+        if not self.g2lu_mlps:
+            raise ValueError(
+                "No SwiGLU MLPs found (need gate_proj + up_proj attributes). "
+                "This model may not use gated linear units."
+            )
+    def set_step(self, step: int):
+        """Update blend alpha across all G²LU MLPs."""
+        self._current_step = step
+        alpha = min(step / max(self.warmup_steps, 1), 1.0)
+        for g2lu in self.g2lu_mlps:
+            g2lu._alpha = alpha
+    def trainable_parameters(self):
+        """Yield only unfrozen parameters (for optimizer and grad clipping)."""
+        for param in self.model.parameters():
+            if param.requires_grad:
+                yield param
+    def collect_align_loss(self):
+        """Average per-layer alignment losses."""
+        losses = [g2lu._align_loss for g2lu in self.g2lu_mlps if g2lu._align_loss is not None]
+        if not losses:
+            return torch.tensor(0.0)
+        return torch.stack(losses).mean()
+    def forward(self, input_ids, labels=None, **kwargs):
+        outputs = self.model(input_ids=input_ids, labels=labels, **kwargs)
+        result = {"logits": outputs.logits}
+        align_loss = self.collect_align_loss()
+        result["align_loss"] = align_loss
+        if labels is not None:
+            # Combine LM loss + alignment loss
+            result["loss"] = outputs.loss + self.align_weight * align_loss
+            result["lm_loss"] = outputs.loss
+        else:
+            result["loss"] = align_loss
+        return result
+    def generate(self, input_ids, **kwargs):
+        """Delegate to HF model's .generate()."""
+        return self.model.generate(input_ids=input_ids, **kwargs)
+def save_g2lu_checkpoint(
+    model: G2LU_GraftedModel,
+    optimizer: torch.optim.Optimizer,
+    step: int,
+    epoch: int,
+    loss: float,
+    path: str,
+    epoch_step: int = 0,
+    best_val_loss: float | None = None,
+    scaler=None,
+    tokenizer_name: str = None,
+):
+    """Delta save: only trainable params + metadata."""
+    # Extract only requires_grad params
+    raw = model.model if not hasattr(model, '_orig_mod') else model._orig_mod.model
+    # Handle torch.compile wrapper
+    if hasattr(model, '_orig_mod'):
+        g2lu_model = model._orig_mod
+    else:
+        g2lu_model = model
+    delta_sd = {}
+    full_sd = g2lu_model.model.state_dict()
+    for name, param in g2lu_model.model.named_parameters():
+        if param.requires_grad:
+            # Strip _orig_mod. prefix if present
+            clean_name = name.removeprefix("_orig_mod.")
+            delta_sd[clean_name] = full_sd.get(name, param.data).clone()
+    # Also save the w4 weights explicitly (they're part of the replaced modules)
+    for name, val in full_sd.items():
+        clean_name = name.removeprefix("_orig_mod.")
+        if ".w4." in clean_name and clean_name not in delta_sd:
+            delta_sd[clean_name] = val.clone()
+    checkpoint = {
+        "model": delta_sd,
+        "optimizer": optimizer.state_dict(),
+        "step": step,
+        "epoch": epoch,
+        "epoch_step": epoch_step,
+        "loss": loss,
+        "model_type": "graft_g2lu",
+        "pretrained_name": g2lu_model.pretrained_name,
+        "align_weight": g2lu_model.align_weight,
+        "warmup_steps": g2lu_model.warmup_steps,
+        "tokenizer_name": tokenizer_name or g2lu_model.pretrained_name,
+    }
+    if best_val_loss is not None:
+        checkpoint["best_val_loss"] = best_val_loss
+    if scaler is not None:
+        checkpoint["scaler"] = scaler.state_dict()
+    torch.save(checkpoint, path)
+def load_g2lu_model(checkpoint_path: str, device: str = "cuda", dtype=torch.bfloat16):
+    """Delta load: recreate model from pretrained + apply delta weights."""
+    checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+    pretrained_name = checkpoint["pretrained_name"]
+    align_weight = checkpoint.get("align_weight", 1.0)
+    warmup_steps = checkpoint.get("warmup_steps", 500)
+    model = G2LU_GraftedModel(
+        pretrained_name=pretrained_name,
+        align_weight=align_weight,
+        warmup_steps=warmup_steps,
+        device=device,
+        dtype=dtype,
+    )
+    # Load delta weights
+    delta_sd = checkpoint["model"]
+    # Strip _orig_mod. prefix if present
+    delta_sd = {k.removeprefix("_orig_mod."): v for k, v in delta_sd.items()}
+    # Apply delta weights to the model
+    missing, unexpected = model.model.load_state_dict(delta_sd, strict=False)
+    if unexpected:
+        print(f"  Warning: unexpected keys in delta checkpoint: {unexpected[:5]}...")
+    # Set alpha to 1.0 for inference (full G²LU)
+    model.set_step(warmup_steps + 1)
+    return model

layers.py ADDED Viewed

	@@ -0,0 +1,325 @@

+"""
+Shared building blocks for Circuit Transformer architectures.
+Components:
+- RMSNorm: Root Mean Square Layer Normalization
+- RotaryEmbedding: Rotary Position Embedding (RoPE)
+- CausalAttention: Multi-head causal attention with RoPE + KV cache
+- SwiGLU: Gated feed-forward network
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from functools import lru_cache
+class RMSNorm(nn.Module):
+    """Root Mean Square Layer Normalization."""
+    def __init__(self, dim: int, eps: float = 1e-6):
+        super().__init__()
+        self.eps = eps
+        self.weight = nn.Parameter(torch.ones(dim))
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
+        return (x.float() * norm).type_as(x) * self.weight
+def build_word_start_table(tokenizer, vocab_size: int) -> torch.BoolTensor:
+    """Build a boolean table marking which token IDs start a new word.
+    Detects word boundaries from tokenizer's token representations:
+    - Ġ prefix (GPT-2/BPE style)
+    - ▁ prefix (SentencePiece style)
+    - Special tokens (starting with <)
+    """
+    table = torch.zeros(vocab_size, dtype=torch.bool)
+    # Get all token strings — handle both HF and SentencePiece tokenizers
+    if hasattr(tokenizer, 'convert_ids_to_tokens'):
+        tokens = tokenizer.convert_ids_to_tokens(list(range(vocab_size)))
+    elif hasattr(tokenizer, 'sp'):
+        tokens = [tokenizer.sp.IdToPiece(i) for i in range(vocab_size)]
+    else:
+        tokens = [tokenizer.decode([i]) for i in range(vocab_size)]
+    for idx, tok in enumerate(tokens):
+        if tok is None:
+            continue
+        if tok.startswith('Ġ') or tok.startswith('▁') or tok.startswith('<'):
+            table[idx] = True
+        # Punctuation and newlines that start new "words"
+        elif len(tok) > 0 and tok[0] in '\n\r\t':
+            table[idx] = True
+    # Token 0 is always a word starter (BOS/padding)
+    table[0] = True
+    return table
+def compute_word_positions(input_ids: torch.Tensor, word_start_table: torch.Tensor) -> torch.Tensor:
+    """Compute position-within-word for each token. Vectorized, no loops.
+    Args:
+        input_ids: [B, L] token IDs
+        word_start_table: [vocab_size] bool tensor from build_word_start_table
+    Returns:
+        [B, L] float tensor: 0, 1, 2, 0, 1, 0, ... (resets at each word boundary)
+    """
+    is_word_start = word_start_table[input_ids]  # [B, L]
+    is_word_start[:, 0] = True  # First token always starts a word
+    B, L = input_ids.shape
+    positions = torch.arange(L, device=input_ids.device, dtype=torch.float32).unsqueeze(0).expand(B, -1)
+    # Fill non-word-start positions with -1, word-start positions with their index
+    fill = torch.where(is_word_start, positions, torch.tensor(-1.0, device=input_ids.device))
+    # cummax propagates the most recent word-start position forward
+    running_start, _ = fill.cummax(dim=1)
+    # Position within word = distance from the most recent word start
+    word_pos = positions - running_start  # [B, L] float: 0, 1, 2, 0, 1, 0, ...
+    return word_pos
+class WordPositionRoPE(nn.Module):
+    """RoPE encoding for position-within-word.
+    Dedicates a small subspace of head dimensions to word-internal position,
+    using separate (lower) frequency bases. Overrides the last `word_dims`
+    of the standard RoPE cos/sin tensors.
+    """
+    def __init__(self, word_dims: int, word_base: float = 10.0):
+        super().__init__()
+        self.word_dims = word_dims
+        word_inv_freq = 1.0 / (word_base ** (torch.arange(0, word_dims, 2).float() / word_dims))
+        self.register_buffer("word_inv_freq", word_inv_freq)
+    def forward(
+        self, cos: torch.Tensor, sin: torch.Tensor, word_positions: torch.Tensor
+    ) -> tuple[torch.Tensor, torch.Tensor]:
+        """Override last word_dims of cos/sin with word-position-derived values.
+        Args:
+            cos, sin: [L, head_dim] from standard RotaryEmbedding
+            word_positions: [B, L] float tensor (position within word)
+        Returns:
+            cos, sin: [B, L, head_dim] with word dims overridden
+        """
+        B, L = word_positions.shape
+        # Compute word angles: [B, L, word_dims/2]
+        angles = word_positions.unsqueeze(-1) * self.word_inv_freq
+        # Duplicate for rotate_half pattern: [B, L, word_dims]
+        word_emb = torch.cat([angles, angles], dim=-1)
+        word_cos = word_emb.cos()
+        word_sin = word_emb.sin()
+        # Expand standard cos/sin to batch dimension: [L, D] -> [B, L, D]
+        cos = cos.unsqueeze(0).expand(B, -1, -1).clone()
+        sin = sin.unsqueeze(0).expand(B, -1, -1).clone()
+        # Override last word_dims with word-position values
+        cos[:, :, -self.word_dims:] = word_cos
+        sin[:, :, -self.word_dims:] = word_sin
+        return cos, sin
+class RotaryEmbedding(nn.Module):
+    """Rotary Position Embedding (RoPE)."""
+    def __init__(self, dim: int, max_seq_len: int = 2048, base: float = 10000.0):
+        super().__init__()
+        self.dim = dim
+        self.max_seq_len = max_seq_len
+        inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer("inv_freq", inv_freq)
+        self._build_cache(max_seq_len)
+    def _build_cache(self, seq_len: int):
+        t = torch.arange(seq_len, device=self.inv_freq.device)
+        freqs = torch.outer(t, self.inv_freq)
+        emb = torch.cat((freqs, freqs), dim=-1)
+        self.register_buffer("cos_cached", emb.cos(), persistent=False)
+        self.register_buffer("sin_cached", emb.sin(), persistent=False)
+    def forward(self, x: torch.Tensor, seq_len: int) -> tuple[torch.Tensor, torch.Tensor]:
+        if seq_len > self.cos_cached.size(0):
+            self._build_cache(seq_len)
+        return self.cos_cached[:seq_len], self.sin_cached[:seq_len]
+def rotate_half(x: torch.Tensor) -> torch.Tensor:
+    """Rotate half the hidden dims."""
+    x1, x2 = x.chunk(2, dim=-1)
+    return torch.cat((-x2, x1), dim=-1)
+def apply_rotary_pos_emb(
+    q: torch.Tensor, k: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Apply rotary position embedding to queries and keys.
+    Handles both standard [L, D] and batched [B, L, D] cos/sin.
+    Q, K shape: [B, H, L, D]. For batched cos/sin, unsqueeze dim 1 for head broadcast.
+    """
+    if cos.dim() == 3:  # [B, L, D] from WordPositionRoPE
+        cos = cos.unsqueeze(1)  # [B, 1, L, D] — broadcast over heads
+        sin = sin.unsqueeze(1)
+    q_embed = (q * cos) + (rotate_half(q) * sin)
+    k_embed = (k * cos) + (rotate_half(k) * sin)
+    return q_embed, k_embed
+class CausalAttention(nn.Module):
+    """Multi-head attention with causal mask, RoPE, and optional GQA.
+    Supports Grouped Query Attention (GQA) where num_kv_heads < num_heads.
+    Each KV head serves (num_heads // num_kv_heads) query heads.
+    KV cache stored at kv_heads granularity for memory efficiency.
+    """
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        num_kv_heads: int | None = None,
+        max_seq_len: int = 2048,
+        dropout: float = 0.0,
+        window_size: int | None = None,
+        word_rope_dims: int = 0,
+        word_rope_base: float = 10.0,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads or num_heads
+        self.head_dim = hidden_size // num_heads
+        self.num_kv_groups = self.num_heads // self.num_kv_heads
+        self.dropout = dropout
+        self.window_size = window_size
+        assert self.num_heads % self.num_kv_heads == 0, \
+            f"num_heads ({self.num_heads}) must be divisible by num_kv_heads ({self.num_kv_heads})"
+        if word_rope_dims > 0:
+            assert word_rope_dims <= self.head_dim, \
+                f"word_rope_dims ({word_rope_dims}) must be <= head_dim ({self.head_dim})"
+            assert word_rope_dims % 2 == 0, \
+                f"word_rope_dims ({word_rope_dims}) must be even"
+        self.q_proj = nn.Linear(hidden_size, self.num_heads * self.head_dim, bias=False)
+        self.k_proj = nn.Linear(hidden_size, self.num_kv_heads * self.head_dim, bias=False)
+        self.v_proj = nn.Linear(hidden_size, self.num_kv_heads * self.head_dim, bias=False)
+        self.o_proj = nn.Linear(hidden_size, hidden_size, bias=False)
+        self.rotary = RotaryEmbedding(self.head_dim, max_seq_len)
+        # Word-position RoPE (optional)
+        self.word_rope = WordPositionRoPE(word_rope_dims, word_rope_base) if word_rope_dims > 0 else None
+        # Build causal mask (optionally windowed)
+        mask = torch.tril(torch.ones(max_seq_len, max_seq_len))
+        if window_size is not None:
+            # Band mask: position i attends to [max(0, i-window+1), i]
+            band = torch.triu(torch.ones(max_seq_len, max_seq_len), diagonal=-(window_size - 1))
+            mask = mask * band
+        self.register_buffer(
+            "causal_mask",
+            mask.view(1, 1, max_seq_len, max_seq_len),
+            persistent=False,
+        )
+    def _expand_kv(self, kv: torch.Tensor) -> torch.Tensor:
+        """Expand KV heads to match Q heads for GQA. No-op if num_kv_heads == num_heads."""
+        if self.num_kv_groups == 1:
+            return kv
+        B, H_kv, L, D = kv.shape
+        return kv.unsqueeze(2).expand(B, H_kv, self.num_kv_groups, L, D).reshape(B, self.num_heads, L, D)
+    def forward(
+        self, x: torch.Tensor, use_cache: bool = False, past_kv: tuple | None = None,
+        word_positions: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor, tuple | None]:
+        B, L, _ = x.shape
+        q = self.q_proj(x).view(B, L, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.k_proj(x).view(B, L, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.v_proj(x).view(B, L, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        # RoPE: use correct position offset for KV-cached generation
+        offset = past_kv[0].size(2) if past_kv is not None else 0
+        cos, sin = self.rotary(x, offset + L)
+        cos = cos[offset:offset + L]
+        sin = sin[offset:offset + L]
+        # Override word-position dims if enabled
+        if self.word_rope is not None and word_positions is not None:
+            cos, sin = self.word_rope(cos, sin, word_positions)
+        q, k = apply_rotary_pos_emb(q, k, cos, sin)
+        # KV cache at kv_heads granularity (memory efficient for GQA)
+        if past_kv is not None:
+            past_k, past_v = past_kv
+            k = torch.cat([past_k, k], dim=2)
+            v = torch.cat([past_v, v], dim=2)
+        new_kv = (k, v) if use_cache else None
+        dropout_p = self.dropout if self.training else 0.0
+        use_gqa = self.num_kv_groups > 1
+        if self.window_size is not None:
+            # Windowed attention: manual path (SDPA FlashAttention doesn't support arbitrary masks)
+            k_expanded = self._expand_kv(k)
+            v_expanded = self._expand_kv(v)
+            seq_len = k.size(2)
+            attn = torch.matmul(q, k_expanded.transpose(-2, -1)) / math.sqrt(self.head_dim)
+            if seq_len <= self.causal_mask.size(-1):
+                mask = self.causal_mask[:, :, offset:offset + L, :seq_len]
+                attn = attn.masked_fill(mask == 0, float("-inf"))
+            attn = F.softmax(attn, dim=-1)
+            if dropout_p > 0:
+                attn = F.dropout(attn, p=dropout_p)
+            out = torch.matmul(attn, v_expanded)
+        else:
+            # SDPA: auto-dispatches to FlashAttention2 / memory-efficient / math backend
+            # Native GQA support avoids expanding KV heads (saves memory + enables FlashAttention GQA kernel)
+            is_causal = past_kv is None and L > 1
+            out = F.scaled_dot_product_attention(
+                q, k, v,
+                dropout_p=dropout_p,
+                is_causal=is_causal,
+                enable_gqa=use_gqa,
+            )
+        out = out.transpose(1, 2).contiguous().view(B, L, self.hidden_size)
+        return self.o_proj(out), new_kv
+class SwiGLU(nn.Module):
+    """SwiGLU feed-forward network."""
+    def __init__(self, hidden_size: int, intermediate_size: int | None = None):
+        super().__init__()
+        intermediate_size = intermediate_size or int(hidden_size * 8 / 3)
+        intermediate_size = ((intermediate_size + 63) // 64) * 64
+        self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
+        self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.w2(F.silu(self.w1(x)) * self.w3(x))

lm_eval_wrapper.py ADDED Viewed

	@@ -0,0 +1,344 @@

+"""
+LM-eval harness wrapper for Circuit/Mirrored transformers.
+Usage:
+    # Single model
+    python -m circuits.bench --checkpoint circuits/checkpoints/mirrored/best.pt --gpu 0
+    # Compare all architectures
+    python -m circuits.bench --compare --gpu 0
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import List
+from tqdm import tqdm
+from lm_eval.api.model import LM
+from lm_eval.api.instance import Instance
+from .config import CircuitConfig
+from .model import CircuitTransformer
+from .mirrored import MirroredConfig, MirroredTransformer
+from .graft_g2lu import load_g2lu_model
+from .layers import build_word_start_table, compute_word_positions
+from .data import get_tokenizer
+def _migrate_state_dict(state_dict: dict, model: nn.Module) -> dict:
+    """Migrate checkpoint state_dict to match current model architecture.
+    Handles upgrades like SwiGLU → MirroredSwiGLU (dual_gate_middle).
+    """
+    if any(k.startswith("_orig_mod.") for k in state_dict):
+        state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+    model_keys = set(model.state_dict().keys())
+    ckpt_keys = set(state_dict.keys())
+    missing = model_keys - ckpt_keys
+    unexpected = ckpt_keys - model_keys
+    print(unexpected)
+    if not missing and not unexpected:
+        return state_dict  # perfect match, no migration needed
+    migrated = dict(state_dict)
+    migrations = []
+    # SwiGLU → MirroredSwiGLU: w3 → gate_expand (dual_gate_middle upgrade)
+    for key in list(unexpected):
+        if ".ffn.gate_expand.weight" in key:
+            new_key = key.replace(".ffn.gate_expand.weight", ".ffn.w3.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+        if ".ffn.gate_compress.weight" in key:
+            new_key = key.replace(".ffn.gate_compress.weight", ".ffn.w4.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+    if migrations:
+        print(f"State dict migration ({len(migrations)} keys renamed):")
+        for m in migrations:
+            print(m)
+        # Report remaining missing keys (freshly initialized)
+        still_missing = model_keys - set(migrated.keys())
+        if still_missing:
+            print(f"  New parameters (freshly initialized): {len(still_missing)}")
+            for k in sorted(still_missing):
+                print(f"    {k}")
+    return migrated
+def load_model(checkpoint_path: str, device: str = "cuda"):
+    """Load any circuit model from checkpoint with auto-detection."""
+    checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+    model_type = checkpoint.get("model_type", "standard")
+    if model_type == "graft_g2lu":
+        model = load_g2lu_model(checkpoint_path, device=device)
+        model.eval()
+        n_layers = len(model.g2lu_mlps)
+        arch_name = f"G²LU Graft ({checkpoint['pretrained_name']}, {n_layers}L)"
+        config = model.model.config  # HF config
+        return model, config, arch_name, model_type
+    elif model_type == "mirrored":
+        if checkpoint["config"].get("dual_gate_middle"):
+            checkpoint["config"].pop("dual_gate_middle")
+        config = MirroredConfig.from_dict(checkpoint["config"])
+        model = MirroredTransformer(config)
+        arch_name = f"Mirrored ({model.total_virtual_layers}L)"
+    else:
+        config = CircuitConfig.from_dict(checkpoint["config"])
+        model = CircuitTransformer(config)
+        arch_name = f"Standard ({config.num_layers}L)"
+    # Strip _orig_mod. prefix from torch.compile'd checkpoints
+    state_dict = checkpoint["model"]
+    state_dict = _migrate_state_dict(state_dict, model)
+    model.load_state_dict(state_dict)
+    model = model.to(device).eval()
+    return model, config, arch_name, model_type
+class CircuitLM(LM):
+    """LM-eval wrapper for Circuit transformer family."""
+    def __init__(
+        self,
+        checkpoint: str,
+        device: str = "cuda",
+        batch_size: int = 1,
+        compile: bool = False,
+    ):
+        super().__init__()
+        self.model, self.config, self.arch_name, self.model_type = load_model(
+            checkpoint, device
+        )
+        # Keep raw reference for .generate() — torch.compile only wraps forward()
+        self._raw_model = self.model
+        if compile == True:
+            self.model = torch.compile(self.model)
+            print("  torch.compile: enabled")
+        _ckpt = torch.load(checkpoint, map_location="cpu", weights_only=False)
+        _tok_name = _ckpt.get("tokenizer_name", "gpt2")
+        del _ckpt
+        self.tokenizer = get_tokenizer(_tok_name)
+        if self.tokenizer.pad_token is None:
+            self.tokenizer.pad_token = self.tokenizer.eos_token
+        self._device = device
+        self._batch_size = batch_size
+        # Build word-position table if model uses SemRoPE
+        self._word_start_table = None
+        word_rope_dims = getattr(self.config, 'word_rope_dims', 0)
+        if word_rope_dims == 0 and isinstance(self.config, dict):
+            word_rope_dims = self.config.get('word_rope_dims', 0)
+        if word_rope_dims > 0:
+            self._word_start_table = build_word_start_table(
+                self.tokenizer, len(self.tokenizer)
+            ).to(device)
+            print(f"  Word-position RoPE: {word_rope_dims} dims")
+        # Count parameters
+        n_params = sum(p.numel() for p in self.model.parameters())
+        print(f"  Architecture: {self.arch_name}")
+        print(f"  Parameters: {n_params / 1e6:.1f}M")
+    @property
+    def eot_token_id(self):
+        return self.tokenizer.eos_token_id
+    @property
+    def max_length(self):
+        return getattr(self.config, "max_seq_len", None) or getattr(self.config, "max_position_embeddings", 512)
+    @property
+    def max_gen_toks(self):
+        return 256
+    @property
+    def batch_size(self):
+        return self._batch_size
+    @property
+    def device(self):
+        return self._device
+    def tok_encode(self, string: str) -> List[int]:
+        return self.tokenizer.encode(string, add_special_tokens=False)
+    def tok_decode(self, tokens: List[int]) -> str:
+        return self.tokenizer.decode(tokens)
+    def _model_call(self, input_ids: torch.Tensor):
+        with torch.inference_mode(), torch.autocast('cuda', dtype=torch.bfloat16, enabled=self._device != "cpu"):
+            word_positions = None
+            if self._word_start_table is not None:
+                word_positions = compute_word_positions(input_ids, self._word_start_table)
+            output = self.model(input_ids, use_cache=False, word_positions=word_positions)
+        return output["logits"]
+    def _loglikelihood_tokens(self, requests, disable_tqdm=False):
+        results = []
+        for context_enc, continuation_enc in requests:
+            # Truncate from the left if too long
+            full_enc = context_enc + continuation_enc
+            if len(full_enc) > self.max_length:
+                excess = len(full_enc) - self.max_length
+                context_enc = context_enc[excess:]
+                full_enc = context_enc + continuation_enc
+            input_ids = torch.tensor(
+                [full_enc], dtype=torch.long, device=self._device
+            )
+            logits = self._model_call(input_ids)
+            ctx_len = len(context_enc)
+            cont_logits = logits[:, ctx_len - 1 : -1, :]
+            cont_tokens = input_ids[:, ctx_len:]
+            log_probs = F.log_softmax(cont_logits, dim=-1)
+            token_log_probs = log_probs.gather(
+                2, cont_tokens.unsqueeze(-1)
+            ).squeeze(-1)
+            total_log_prob = token_log_probs.sum().item()
+            is_greedy = (cont_logits.argmax(dim=-1) == cont_tokens).all().item()
+            results.append((total_log_prob, is_greedy))
+        return results
+    def loglikelihood(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[tuple]:
+        results = []
+        for request in tqdm(
+            requests, desc="loglikelihood", disable=disable_tqdm
+        ):
+            context, continuation = request.args
+            # Encode full text together to get correct tokenization,
+            # then split — sentencepiece tokenizes differently at string
+            # boundaries vs mid-sequence (the leading ▁ problem)
+            context_enc = self.tok_encode(context)
+            full_enc = self.tok_encode(context + continuation)
+            continuation_enc = full_enc[len(context_enc):]
+            if not continuation_enc:
+                # Edge case: continuation was absorbed into context tokens
+                # Fall back to encoding continuation separately
+                continuation_enc = self.tok_encode(continuation)
+            result = self._loglikelihood_tokens([(context_enc, continuation_enc)])
+            results.append(result[0])
+        return results
+    def loglikelihood_rolling(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[float]:
+        results = []
+        for request in tqdm(
+            requests, desc="loglikelihood_rolling", disable=disable_tqdm
+        ):
+            text = request.args[0]
+            encoding = self.tok_encode(text)
+            total_log_prob = 0.0
+            max_len = self.max_length
+            for i in range(0, len(encoding), max_len):
+                chunk = encoding[i : i + max_len]
+                input_ids = torch.tensor(
+                    [chunk], dtype=torch.long, device=self._device
+                )
+                logits = self._model_call(input_ids)
+                shift_logits = logits[:, :-1, :]
+                shift_labels = input_ids[:, 1:]
+                log_probs = F.log_softmax(shift_logits, dim=-1)
+                token_log_probs = log_probs.gather(
+                    2, shift_labels.unsqueeze(-1)
+                ).squeeze(-1)
+                total_log_prob += token_log_probs.sum().item()
+            results.append(total_log_prob)
+        return results
+    def generate_until(
+        self, requests: List[Instance], disable_tqdm: bool = False
+    ) -> List[str]:
+        results = []
+        for request in tqdm(
+            requests, desc="generate_until", disable=disable_tqdm
+        ):
+            context = request.args[0]
+            gen_kwargs = getattr(request, "kwargs", {}) or {}
+            until = gen_kwargs.get("until", [self.tokenizer.eos_token])
+            max_gen = gen_kwargs.get("max_gen_toks", self.max_gen_toks)
+            context_enc = self.tok_encode(context)
+            # Truncate context from left if needed
+            if len(context_enc) > self.max_length - max_gen:
+                context_enc = context_enc[-(self.max_length - max_gen) :]
+            input_ids = torch.tensor(
+                [context_enc], dtype=torch.long, device=self._device
+            )
+            if self.model_type == "graft_g2lu":
+                # Use HF's native generate with KV caching — much faster than
+                # manual token-by-token without cache (O(n) vs O(n²))
+                with torch.no_grad():
+                    output_ids = self._raw_model.generate(
+                        input_ids,
+                        max_new_tokens=max_gen,
+                        do_sample=False,
+                        use_cache=True,
+                    )
+                generated_text = self.tok_decode(
+                    output_ids[0, input_ids.shape[1] :].tolist()
+                )
+            else:
+                generated_ids = input_ids.clone()
+                with torch.no_grad():
+                    for _ in range(max_gen):
+                        # Truncate if we exceed max_length
+                        if generated_ids.shape[1] > self.max_length:
+                            generated_ids = generated_ids[:, -self.max_length :]
+                        logits = self._model_call(generated_ids)
+                        next_logits = logits[:, -1, :]
+                        next_token = next_logits.argmax(dim=-1, keepdim=True)
+                        generated_ids = torch.cat([generated_ids, next_token], dim=1)
+                        if next_token.item() == self.eot_token_id:
+                            break
+                        current_text = self.tok_decode(
+                            generated_ids[0, len(context_enc) :].tolist()
+                        )
+                        if any(s in current_text for s in until):
+                            break
+                generated_text = self.tok_decode(
+                    generated_ids[0, len(context_enc) :].tolist()
+                )
+            for stop in until:
+                if stop in generated_text:
+                    generated_text = generated_text[: generated_text.index(stop)]
+            results.append(generated_text)
+        return results

mirrored.py ADDED Viewed

	@@ -0,0 +1,532 @@

+"""
+Mirrored Transformer: Weight-sharing between expand and compress phases.
+Based on the biconcave lens hypothesis from grafting research:
+- Early layers expand from tokens to semantic space
+- Late layers compress from semantic space back to tokens
+- These phases share structural computation (W₁, W₂)
+- Only the gate (semiotic filter) differs by direction
+Architecture:
+  y = W₂ @ (W₁ @ x ⊙ swish(W₃ @ swish(W₄ @ x)))
+Both gates fire every pass (additive, OR-logic). W₁ computed once.
+W₁, W₂ shared between mirror pairs. W₃, W₄ are dual gates.
+~33% FFN parameter savings per mirrored pair vs standard SwiGLU.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from dataclasses import dataclass, fields
+from .layers import RMSNorm, CausalAttention, SwiGLU
+@dataclass
+class MirroredConfig:
+    """Configuration for Mirrored Transformer."""
+    vocab_size: int = 50257
+    hidden_size: int = 768
+    num_heads: int = 12
+    num_kv_heads: int | None = None  # GQA: None = same as num_heads (MHA)
+    num_layers: int = 12       # effective depth (expand + middle + compress)
+    n_middle: int = 2          # unique middle layers (standard SwiGLU)
+    max_seq_len: int = 512
+    dropout: float = 0.0
+    aux_skip_k: int = 0              # skip-ahead prediction distance (0 = disabled)
+    aux_skip_weight: float = 0.1     # weight for auxiliary skip loss
+    use_g2lu: bool = True              # G²LU nested gates (False = vanilla SwiGLU)
+    word_rope_dims: int = 0          # head dims for word-position RoPE (0 = disabled)
+    word_rope_base: float = 10.0     # frequency base for word-position RoPE
+    embed_dim: int = 0               # factorized embedding dim (0 = use hidden_size)
+    head_dim: int = 0                # MLP head intermediate dim (0 = linear head)
+    def __post_init__(self):
+        assert self.hidden_size % self.num_heads == 0, "hidden_size must be divisible by num_heads"
+        if self.num_kv_heads is not None:
+            assert self.num_heads % self.num_kv_heads == 0, \
+                f"num_heads ({self.num_heads}) must be divisible by num_kv_heads ({self.num_kv_heads})"
+        n_mirror_layers = self.num_layers - self.n_middle
+        assert n_mirror_layers > 0, "num_layers must be greater than n_middle"
+        assert n_mirror_layers % 2 == 0, "num_layers - n_middle must be even"
+        self.n_mirror = n_mirror_layers // 2
+    def to_dict(self) -> dict:
+        """Convert to dictionary for serialization."""
+        return {f.name: getattr(self, f.name) for f in fields(self) if f.name != "n_mirror"}
+    @classmethod
+    def from_dict(cls, d: dict) -> "MirroredConfig":
+        """Create from dictionary."""
+        valid = {f.name for f in fields(cls)}
+        filtered = {k: v for k, v in d.items() if k in valid}
+        return cls(**filtered)
+class MLP(nn.Module):
+    """Feed-forward network with SiLU activation."""
+    def __init__(self, dim, intermediate_size, dropout):
+        super().__init__()
+        self.up_proj = nn.Linear(dim, intermediate_size, bias=False)
+        self.gate_proj = nn.Linear(dim, intermediate_size, bias=False)
+        self.down_proj = nn.Linear(intermediate_size, dim, bias=False)
+        self.dropout = nn.Dropout(dropout)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return self.dropout(self.down_proj(F.silu(self.gate_proj(x)) * self.up_proj(x)))
+class MirroredSwiGLU(nn.Module):
+    """SwiGLU with shared base weights and dual gates.
+    Standard SwiGLU:  y = W₂(silu(W₁x) ⊙ W₃x)                   — 3 matrices
+    Mirrored SwiGLU:  y = W₂(W₁x ⊙ (silu(W₃ ⊙ silu(W₄x))))     — 2 shared + 2 gates
+    W₁ computed once, reused for both branches.
+    """
+    def __init__(self, hidden_size: int, intermediate_size: int | None = None,
+                 gate_mode: str = 'additive', use_g2lu: bool = True):
+        super().__init__()
+        self.gate_mode = gate_mode
+        self.use_g2lu = use_g2lu
+        self._current_step = 0
+        intermediate_size = intermediate_size or int(hidden_size * 8 / 3)
+        intermediate_size = ((intermediate_size + 63) // 64) * 64
+        # Shared structural transform
+        self.w1 = nn.Linear(hidden_size, intermediate_size, bias=False)
+        self.w2 = nn.Linear(intermediate_size, hidden_size, bias=False)
+        # Gate(s)
+        self.w3 = nn.Linear(hidden_size, intermediate_size, bias=False)
+        if use_g2lu:
+            self.w4 = nn.Linear(hidden_size, intermediate_size, bias=False)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        hidden = self.w1(x)
+        if self.use_g2lu:
+            g4 = F.silu(self.w4(x))
+            g3 = F.silu(self.w3(x) * g4)
+        else:
+            g3 = F.silu(self.w3(x))
+        return self.w2(hidden * g3)
+class MirroredBlock(nn.Module):
+    """Transformer block with shared weights for expand/compress phases.
+    Each MirroredBlock is used TWICE in the forward pass:
+    once during expand (building semantics) and once during compress (encoding output).
+    Shared: attention weights (optional), FFN W₁/W₂
+    Separate: norms (different residual stream statistics), FFN gate
+    """
+    def __init__(self, hidden_size: int, num_heads: int, num_kv_heads: int | None = None,
+                 max_seq_len: int = 2048,
+                 dropout: float = 0.0,
+                 window_size: int | None = None, gate_mode: str = 'additive',
+                 word_rope_dims: int = 0, word_rope_base: float = 10.0,
+                 use_g2lu: bool = True):
+        super().__init__()
+        self.attn = CausalAttention(hidden_size, num_heads, num_kv_heads, max_seq_len, dropout, window_size=window_size,
+                                    word_rope_dims=word_rope_dims, word_rope_base=word_rope_base)
+        # FFN with shared base + direction-specific gates
+        self.ffn = MirroredSwiGLU(hidden_size, gate_mode=gate_mode, use_g2lu=use_g2lu)
+        # Separate norms per direction (residual stream statistics differ)
+        self.expand_attn_norm = RMSNorm(hidden_size)
+        self.expand_ffn_norm = RMSNorm(hidden_size)
+        self.compress_attn_norm = RMSNorm(hidden_size)
+        self.compress_ffn_norm = RMSNorm(hidden_size)
+    def forward(self, x: torch.Tensor, use_cache: bool = False, past_kv: tuple = None,
+                word_positions: torch.Tensor | None = None) -> tuple:
+        attn_norm = self.compress_attn_norm
+        ffn_norm = self.compress_ffn_norm
+        attn = self.attn
+        attn_out, new_kv = attn(attn_norm(x), use_cache, past_kv, word_positions=word_positions)
+        x = x + attn_out
+        x = x + self.ffn(ffn_norm(x))
+        return x, new_kv
+class MiddleBlock(nn.Module):
+    """Standard transformer block for unique middle layers.
+    When gate_mode is provided, uses MirroredSwiGLU (dual-gate) instead of
+    single-gate SwiGLU — giving the middle the same rich gating geometry
+    as the mirror pairs.
+    """
+    def __init__(self, hidden_size: int, num_heads: int, num_kv_heads: int | None = None,
+                 max_seq_len: int = 2048,
+                 dropout: float = 0.0,
+                 word_rope_dims: int = 0, word_rope_base: float = 10.0,
+                 use_g2lu: bool = True):
+        super().__init__()
+        self.attn_norm = RMSNorm(hidden_size)
+        self.attn = CausalAttention(hidden_size, num_heads, num_kv_heads, max_seq_len, dropout,
+                                    word_rope_dims=word_rope_dims, word_rope_base=word_rope_base)
+        self.ffn_norm = RMSNorm(hidden_size)
+        self.ffn = MirroredSwiGLU(hidden_size, use_g2lu=use_g2lu)
+    def forward(self, x: torch.Tensor, use_cache: bool = False, past_kv: tuple = None,
+                word_positions: torch.Tensor | None = None) -> tuple:
+        attn_out, new_kv = self.attn(self.attn_norm(x), use_cache, past_kv, word_positions=word_positions)
+        x = x + attn_out
+        x = x + self.ffn(self.ffn_norm(x))
+        return x, new_kv
+class MirroredTransformer(nn.Module):
+    """Transformer with mirrored expand/compress architecture.
+    Forward pass:
+      1. Embed tokens
+      2. Expand phase: mirror_blocks[0..N] with w3
+      3. Middle: unique standard blocks
+      4. Compress phase: mirror_blocks[N..0] (reversed) with w4
+      5. Norm + LM head
+    For a 12-layer model with n_middle=2:
+      - 5 mirror pairs (10 virtual layers) + 2 middle = 12 effective layers
+      - Expand:   blocks[0] → blocks[4]
+      - Middle:   middle[0] → middle[1]
+      - Compress:  blocks[4] → blocks[0]
+    """
+    def __init__(self, config: MirroredConfig):
+        super().__init__()
+        self.config = config
+        # Token embeddings (optionally factorized)
+        embed_dim = getattr(config, 'embed_dim', 0)
+        head_dim = getattr(config, 'head_dim', 0)
+        # Auto-mirror factorization: head uses embed_dim for weight tying
+        if embed_dim > 0 and head_dim == 0:
+            head_dim = embed_dim
+        # G²LU config (needed before projection setup)
+        use_g2lu = getattr(config, 'use_g2lu', True)
+        if embed_dim > 0:
+            self.embed = nn.Embedding(config.vocab_size, embed_dim)
+            self.embed_proj = nn.Linear(embed_dim, config.hidden_size, bias=False)
+            # G²LU gates for up-projection (consistent with mirror blocks)
+            if use_g2lu:
+                self.embed_g3 = nn.Linear(embed_dim, config.hidden_size, bias=False)
+                self.embed_g4 = nn.Linear(embed_dim, config.hidden_size, bias=False)
+            else:
+                self.embed_g3 = None
+                self.embed_g4 = None
+        else:
+            self.embed = nn.Embedding(config.vocab_size, config.hidden_size)
+            self.embed_proj = None
+            self.embed_g3 = None
+            self.embed_g4 = None
+        self.embed_scale = math.sqrt(config.hidden_size)
+        self.window_sizes = [None] * config.n_mirror
+        # Word-position RoPE config
+        word_rope_dims = getattr(config, 'word_rope_dims', 0)
+        word_rope_base = getattr(config, 'word_rope_base', 10.0)
+        # Mirrored blocks (used in both expand and compress phases)
+        self.mirror_blocks = nn.ModuleList([
+            MirroredBlock(
+                config.hidden_size, config.num_heads, config.num_kv_heads,
+                config.max_seq_len,
+                config.dropout,
+                window_size=self.window_sizes[i],
+                word_rope_dims=word_rope_dims, word_rope_base=word_rope_base,
+                use_g2lu=use_g2lu,
+            )
+            for i in range(config.n_mirror)
+        ])
+        # Unique middle blocks (standard transformer, optionally dual-gated)
+        self.middle_blocks = nn.ModuleList([
+            MiddleBlock(config.hidden_size, config.num_heads, config.num_kv_heads,
+                        config.max_seq_len, config.dropout,
+                        word_rope_dims=word_rope_dims, word_rope_base=word_rope_base,
+                        use_g2lu=use_g2lu)
+            for _ in range(config.n_middle)
+        ])
+        # Output (optionally MLP head)
+        self.norm = RMSNorm(config.hidden_size)
+        if head_dim > 0:
+            self.head_down = nn.Linear(config.hidden_size, head_dim, bias=False)
+            self.lm_head = nn.Linear(head_dim, config.vocab_size, bias=False)
+            # G²LU gates for down-projection
+            if use_g2lu:
+                self.head_g3 = nn.Linear(config.hidden_size, head_dim, bias=False)
+                self.head_g4 = nn.Linear(config.hidden_size, head_dim, bias=False)
+            else:
+                self.head_g3 = None
+                self.head_g4 = None
+        else:
+            self.head_down = None
+            self.head_g3 = None
+            self.head_g4 = None
+            self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Weight tying (when embed and lm_head dimensions match)
+        _e = embed_dim if embed_dim > 0 else config.hidden_size
+        _h = head_dim if head_dim > 0 else config.hidden_size
+        if _e == _h:
+            self.lm_head.weight = self.embed.weight
+        # Auxiliary skip-ahead prediction head
+        self.skip_head = None
+        self.skip_head_down = None
+        self.skip_g3 = None
+        self.skip_g4 = None
+        if config.aux_skip_k > 0:
+            if head_dim > 0:
+                self.skip_head_down = nn.Linear(config.hidden_size, head_dim, bias=False)
+                self.skip_head = nn.Linear(head_dim, config.vocab_size, bias=False)
+                if use_g2lu:
+                    self.skip_g3 = nn.Linear(config.hidden_size, head_dim, bias=False)
+                    self.skip_g4 = nn.Linear(config.hidden_size, head_dim, bias=False)
+            else:
+                self.skip_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Initialize weights
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    @property
+    def total_virtual_layers(self) -> int:
+        """Total number of virtual layers in the forward pass."""
+        return self.config.n_mirror * 2 + self.config.n_middle
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        labels: torch.Tensor = None,
+        use_cache: bool = False,
+        past_kv: list = None,
+        word_positions: torch.Tensor | None = None,
+    ) -> dict:
+        B, L = input_ids.shape
+        # Embed tokens (optionally factorized, with G²LU gating)
+        x = self.embed(input_ids)
+        if self.embed_proj is not None:
+            if self.embed_g3 is not None:
+                g4 = F.silu(self.embed_g4(x))
+                g3 = F.silu(self.embed_g3(x) * g4)
+                x = self.embed_proj(x) * g3
+            else:
+                x = F.silu(self.embed_proj(x))
+        x = x * self.embed_scale
+        new_kv = [] if use_cache else None
+        kv_idx = 0
+        # === Expand phase ===
+        for block in self.mirror_blocks:
+            layer_past = past_kv[kv_idx] if past_kv is not None else None
+            x, kv = block(x, use_cache=use_cache, past_kv=layer_past, word_positions=word_positions)
+            if use_cache:
+                new_kv.append(kv)
+            kv_idx += 1
+        # === Dual-path: save pre-middle state for alignment loss ===
+        for block in self.middle_blocks:
+            layer_past = past_kv[kv_idx] if past_kv is not None else None
+            x, kv = block(x, use_cache=use_cache, past_kv=layer_past, word_positions=word_positions)
+            if use_cache:
+                new_kv.append(kv)
+            kv_idx += 1
+        # === Compress phase (reversed order) ===
+        for i in reversed(range(len(self.mirror_blocks))):
+            layer_past = past_kv[kv_idx] if past_kv is not None else None
+            x, kv = self.mirror_blocks[i](x, use_cache=use_cache, past_kv=layer_past, word_positions=word_positions)
+            if use_cache:
+                new_kv.append(kv)
+            kv_idx += 1
+        # === Output (optionally MLP head with G²LU gating) ===
+        x = self.norm(x)
+        if self.head_down is not None:
+            if self.head_g3 is not None:
+                g4 = F.silu(self.head_g4(x))
+                g3 = F.silu(self.head_g3(x) * g4)
+                logits = self.lm_head(self.head_down(x) * g3)
+            else:
+                logits = self.lm_head(F.silu(self.head_down(x)))
+        else:
+            logits = self.lm_head(x)
+        result = {"logits": logits}
+        if use_cache:
+            result["past_kv"] = new_kv
+        if labels is not None:
+            shift_logits = logits[:, :-1, :].contiguous()
+            shift_labels = labels[:, 1:].contiguous()
+            loss = F.cross_entropy(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1),
+                ignore_index=-100
+            )
+            if self.skip_head is not None:
+                skip_k = self.config.aux_skip_k
+                if self.skip_head_down is not None:
+                    if self.skip_g3 is not None:
+                        g4 = F.silu(self.skip_g4(x))
+                        g3 = F.silu(self.skip_g3(x) * g4)
+                        skip_logits = self.skip_head(self.skip_head_down(x) * g3)[:, :-skip_k, :].contiguous()
+                    else:
+                        skip_logits = self.skip_head(F.silu(self.skip_head_down(x)))[:, :-skip_k, :].contiguous()
+                else:
+                    skip_logits = self.skip_head(x)[:, :-skip_k, :].contiguous()
+                skip_labels = labels[:, skip_k:].contiguous()
+                aux_loss = F.cross_entropy(
+                    skip_logits.view(-1, self.config.vocab_size),
+                    skip_labels.view(-1),
+                    ignore_index=-100
+                )
+                result["aux_loss"] = aux_loss
+                loss = loss + self.config.aux_skip_weight * aux_loss
+            result["loss"] = loss
+        return result
+    @torch.no_grad()
+    def generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 50,
+        temperature: float = 0.8,
+        top_k: int = 50,
+        top_p: float = 0.9,
+        use_cache: bool = True,
+        word_start_table: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        """Autoregressive generation with KV caching."""
+        from .layers import compute_word_positions
+        self.eval()
+        generated = prompt_ids.clone()
+        past_kv = None
+        word_pos_counter = 0
+        for _ in range(max_new_tokens):
+            if use_cache and past_kv is not None:
+                input_ids = generated[:, -1:]
+                if word_start_table is not None:
+                    last_token = generated[0, -1].item()
+                    if word_start_table[last_token]:
+                        word_pos_counter = 0
+                    else:
+                        word_pos_counter += 1
+                    word_positions = torch.tensor([[float(word_pos_counter)]], device=input_ids.device)
+                else:
+                    word_positions = None
+            else:
+                input_ids = generated
+                if word_start_table is not None:
+                    word_positions = compute_word_positions(input_ids, word_start_table)
+                else:
+                    word_positions = None
+            output = self(input_ids, use_cache=use_cache, past_kv=past_kv, word_positions=word_positions)
+            logits = output["logits"][:, -1, :]
+            if use_cache:
+                past_kv = output["past_kv"]
+            if temperature > 0:
+                logits = logits / temperature
+                if top_k > 0:
+                    top_k_vals, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                    min_top_k = top_k_vals[:, -1].unsqueeze(-1)
+                    logits = torch.where(logits < min_top_k, float("-inf"), logits)
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+                    cumsum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    sorted_indices_to_remove = cumsum_probs > top_p
+                    sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+                    sorted_indices_to_remove[:, 0] = False
+                    indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
+                    logits = logits.masked_fill(indices_to_remove, float("-inf"))
+                probs = F.softmax(logits, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                next_token = logits.argmax(dim=-1, keepdim=True)
+            generated = torch.cat([generated, next_token], dim=1)
+            if generated.size(1) >= self.config.max_seq_len:
+                break
+        return generated
+def count_mirrored_parameters(model: MirroredTransformer) -> dict:
+    """Count parameters with breakdown by component."""
+    total = sum(p.numel() for p in model.parameters() if p.requires_grad)
+    # Unique params (not double-counted from weight tying)
+    unique = sum(p.numel() for p in set(p for p in model.parameters() if p.requires_grad))
+    mirror_params = sum(p.numel() for p in model.mirror_blocks.parameters())
+    middle_params = sum(p.numel() for p in model.middle_blocks.parameters())
+    embed_params = model.embed.weight.numel()
+    if model.embed_proj is not None:
+        embed_params += model.embed_proj.weight.numel()
+    head_params = 0
+    if model.head_down is not None:
+        head_params += model.head_down.weight.numel()
+    head_params += model.lm_head.weight.numel()
+    # Break down mirror block into shared vs direction-specific
+    shared_attn = 0
+    shared_ffn_base = 0
+    gate_params = 0
+    norm_params = 0
+    for block in model.mirror_blocks:
+        shared_attn += sum(p.numel() for p in block.attn.parameters())
+        shared_ffn_base += block.ffn.w1.weight.numel() + block.ffn.w2.weight.numel()
+        gate_params += block.ffn.w3.weight.numel()
+        if hasattr(block.ffn, 'w4'):
+            gate_params += block.ffn.w4.weight.numel()
+        norm_params += sum(p.numel() for n, p in block.named_parameters() if 'norm' in n)
+    return {
+        "total": total,
+        "unique": unique,
+        "mirror_blocks": mirror_params,
+        "middle_blocks": middle_params,
+        "embedding": embed_params,
+        "head": head_params,
+        "shared_attention": shared_attn,
+        "shared_ffn_base": shared_ffn_base,
+        "direction_gates": gate_params,
+        "norms": norm_params,
+    }

model.py ADDED Viewed

	@@ -0,0 +1,357 @@

+"""
+Circuit Transformer: Minimal transformer for semantic circuitry experiments.
+Follows patterns from shimmer/lira/gpt.py with extension hooks for future work.
+"""
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+from .config import CircuitConfig
+from .layers import RMSNorm, RotaryEmbedding, CausalAttention, SwiGLU, WordPositionRoPE
+class TransformerBlock(nn.Module):
+    """Pre-norm transformer block with causal attention."""
+    def __init__(
+        self,
+        hidden_size: int,
+        num_heads: int,
+        num_kv_heads: int | None = None,
+        max_seq_len: int = 2048,
+        dropout: float = 0.0,
+        window_size: int | None = None,
+        word_rope_dims: int = 0,
+        word_rope_base: float = 10.0,
+    ):
+        super().__init__()
+        self.attn_norm = RMSNorm(hidden_size)
+        self.attn = CausalAttention(hidden_size, num_heads, num_kv_heads, max_seq_len, dropout, window_size,
+                                    word_rope_dims=word_rope_dims, word_rope_base=word_rope_base)
+        self.ffn_norm = RMSNorm(hidden_size)
+        self.ffn = SwiGLU(hidden_size)
+    def forward(
+        self, x: torch.Tensor, use_cache: bool = False, past_kv: tuple | None = None,
+        word_positions: torch.Tensor | None = None,
+    ) -> tuple[torch.Tensor, tuple | None]:
+        # Attention with residual
+        attn_out, new_kv = self.attn(self.attn_norm(x), use_cache, past_kv, word_positions=word_positions)
+        x = x + attn_out
+        # FFN with residual
+        x = x + self.ffn(self.ffn_norm(x))
+        return x, new_kv
+class CircuitTransformer(nn.Module):
+    """
+    Minimal transformer for semantic circuitry experiments.
+    Features:
+    - Standard GPT-style architecture (RMSNorm, RoPE, SwiGLU, causal attention)
+    - Weight tying (embed = lm_head)
+    - Extension hooks for future work:
+      - freeze_layers() / unfreeze_layers() for progressive training
+      - get_layer_outputs() for interpretability
+      - window_size param for sliding window attention
+    """
+    def __init__(self, config: CircuitConfig):
+        super().__init__()
+        self.config = config
+        # Token embeddings (optionally factorized)
+        embed_dim = getattr(config, 'embed_dim', 0)
+        head_dim = getattr(config, 'head_dim', 0)
+        # Auto-mirror factorization: head uses embed_dim for weight tying
+        if embed_dim > 0 and head_dim == 0:
+            head_dim = embed_dim
+        if embed_dim > 0:
+            self.embed = nn.Embedding(config.vocab_size, embed_dim)
+            self.embed_proj = nn.Linear(embed_dim, config.hidden_size, bias=False)
+        else:
+            self.embed = nn.Embedding(config.vocab_size, config.hidden_size)
+            self.embed_proj = None
+        self.embed_scale = math.sqrt(config.hidden_size)
+        # Transformer blocks
+        self.layers = nn.ModuleList([
+            TransformerBlock(
+                config.hidden_size,
+                config.num_heads,
+                getattr(config, 'num_kv_heads', None),
+                config.max_seq_len,
+                config.dropout,
+                word_rope_dims=getattr(config, 'word_rope_dims', 0),
+                word_rope_base=getattr(config, 'word_rope_base', 10.0),
+            )
+            for _ in range(config.num_layers)
+        ])
+        # Output (optionally MLP head)
+        self.norm = RMSNorm(config.hidden_size)
+        if head_dim > 0:
+            self.head_down = nn.Linear(config.hidden_size, head_dim, bias=False)
+            self.lm_head = nn.Linear(head_dim, config.vocab_size, bias=False)
+        else:
+            self.head_down = None
+            self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Weight tying (when embed and lm_head dimensions match)
+        _e = embed_dim if embed_dim > 0 else config.hidden_size
+        _h = head_dim if head_dim > 0 else config.hidden_size
+        if _e == _h:
+            self.lm_head.weight = self.embed.weight
+        # Auxiliary skip-ahead prediction head
+        self.skip_head = None
+        self.skip_head_down = None
+        aux_skip_k = getattr(config, 'aux_skip_k', 0)
+        if aux_skip_k > 0:
+            if head_dim > 0:
+                self.skip_head_down = nn.Linear(config.hidden_size, head_dim, bias=False)
+                self.skip_head = nn.Linear(head_dim, config.vocab_size, bias=False)
+            else:
+                self.skip_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Track frozen layers
+        self._frozen_layers: set[int] = set()
+        # Initialize weights
+        self.apply(self._init_weights)
+    def _init_weights(self, module):
+        if isinstance(module, nn.Linear):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+            if module.bias is not None:
+                torch.nn.init.zeros_(module.bias)
+        elif isinstance(module, nn.Embedding):
+            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        labels: torch.Tensor | None = None,
+        use_cache: bool = False,
+        past_kv: list | None = None,
+        word_positions: torch.Tensor | None = None,
+    ) -> dict:
+        """
+        Forward pass.
+        Args:
+            input_ids: [B, L] token IDs
+            labels: [B, L] target token IDs (for loss computation)
+            use_cache: Whether to return KV cache for generation
+            past_kv: Previous KV cache
+            word_positions: [B, L] position within word (from compute_word_positions)
+        Returns:
+            dict with 'logits', optionally 'loss' and 'past_kv'
+        """
+        B, L = input_ids.shape
+        # Embed tokens (optionally factorized)
+        x = self.embed(input_ids)
+        if self.embed_proj is not None:
+            x = F.silu(self.embed_proj(x))
+        x = x * self.embed_scale
+        # Process through layers
+        new_kv = [] if use_cache else None
+        for i, layer in enumerate(self.layers):
+            layer_past = past_kv[i] if past_kv is not None else None
+            x, kv = layer(x, use_cache, layer_past, word_positions=word_positions)
+            if use_cache:
+                new_kv.append(kv)
+        # Output (optionally MLP head)
+        x = self.norm(x)
+        if self.head_down is not None:
+            logits = self.lm_head(F.silu(self.head_down(x)))
+        else:
+            logits = self.lm_head(x)
+        result = {"logits": logits}
+        if use_cache:
+            result["past_kv"] = new_kv
+        # Compute loss if labels provided
+        if labels is not None:
+            # Shift for next-token prediction
+            shift_logits = logits[:, :-1, :].contiguous()
+            shift_labels = labels[:, 1:].contiguous()
+            loss = F.cross_entropy(
+                shift_logits.view(-1, self.config.vocab_size),
+                shift_labels.view(-1),
+                ignore_index=-100,
+            )
+            # Auxiliary skip-ahead prediction
+            if self.skip_head is not None:
+                skip_k = getattr(self.config, 'aux_skip_k', 0)
+                skip_weight = getattr(self.config, 'aux_skip_weight', 0.1)
+                if self.skip_head_down is not None:
+                    skip_logits = self.skip_head(F.silu(self.skip_head_down(x)))[:, :-skip_k, :].contiguous()
+                else:
+                    skip_logits = self.skip_head(x)[:, :-skip_k, :].contiguous()
+                skip_labels = labels[:, skip_k:].contiguous()
+                aux_loss = F.cross_entropy(
+                    skip_logits.view(-1, self.config.vocab_size),
+                    skip_labels.view(-1),
+                    ignore_index=-100,
+                )
+                result["aux_loss"] = aux_loss
+                loss = loss + skip_weight * aux_loss
+            result["loss"] = loss
+        return result
+    # === Extension hooks for future experiments ===
+    def freeze_layers(self, indices: list[int]) -> None:
+        """Freeze specific layers (stop gradients)."""
+        for idx in indices:
+            if 0 <= idx < len(self.layers):
+                for param in self.layers[idx].parameters():
+                    param.requires_grad = False
+                self._frozen_layers.add(idx)
+    def unfreeze_layers(self, indices: list[int] | None = None) -> None:
+        """Unfreeze specific layers (or all if indices=None)."""
+        if indices is None:
+            indices = list(self._frozen_layers)
+        for idx in indices:
+            if 0 <= idx < len(self.layers):
+                for param in self.layers[idx].parameters():
+                    param.requires_grad = True
+                self._frozen_layers.discard(idx)
+    def get_layer_outputs(self, input_ids: torch.Tensor) -> list[torch.Tensor]:
+        """Get intermediate outputs from each layer for interpretability."""
+        outputs = []
+        x = self.embed(input_ids)
+        if self.embed_proj is not None:
+            x = F.silu(self.embed_proj(x))
+        x = x * self.embed_scale
+        for layer in self.layers:
+            x, _ = layer(x, use_cache=False, past_kv=None)
+            outputs.append(x.clone())
+        return outputs
+    @torch.no_grad()
+    def generate(
+        self,
+        prompt_ids: torch.Tensor,
+        max_new_tokens: int = 50,
+        temperature: float = 0.8,
+        top_k: int = 50,
+        top_p: float = 0.9,
+        use_cache: bool = True,
+        word_start_table: torch.Tensor | None = None,
+    ) -> torch.Tensor:
+        """
+        Autoregressive generation with KV caching.
+        Args:
+            prompt_ids: [B, L] prompt token IDs
+            max_new_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            top_k: Top-k filtering
+            top_p: Nucleus sampling threshold
+            use_cache: Use KV cache for faster generation
+            word_start_table: [vocab_size] bool tensor for word-position RoPE
+        Returns:
+            [B, L + max_new_tokens] generated token IDs
+        """
+        from .layers import compute_word_positions
+        self.eval()
+        generated = prompt_ids.clone()
+        past_kv = None
+        word_pos_counter = 0  # Track word position during cached generation
+        for _ in range(max_new_tokens):
+            # Get input (full sequence or just last token with cache)
+            if use_cache and past_kv is not None:
+                input_ids = generated[:, -1:]
+                # Compute word position for the single new token
+                if word_start_table is not None:
+                    last_token = generated[0, -1].item()
+                    if word_start_table[last_token]:
+                        word_pos_counter = 0
+                    else:
+                        word_pos_counter += 1
+                    word_positions = torch.tensor([[float(word_pos_counter)]], device=input_ids.device)
+                else:
+                    word_positions = None
+            else:
+                input_ids = generated
+                # Compute word positions for full sequence
+                if word_start_table is not None:
+                    word_positions = compute_word_positions(input_ids, word_start_table)
+                else:
+                    word_positions = None
+            # Forward pass
+            output = self(input_ids, use_cache=use_cache, past_kv=past_kv, word_positions=word_positions)
+            logits = output["logits"][:, -1, :]  # Last position
+            if use_cache:
+                past_kv = output["past_kv"]
+            # Apply temperature
+            if temperature > 0:
+                logits = logits / temperature
+                # Top-k filtering
+                if top_k > 0:
+                    top_k_vals, _ = torch.topk(logits, min(top_k, logits.size(-1)))
+                    min_top_k = top_k_vals[:, -1].unsqueeze(-1)
+                    logits = torch.where(logits < min_top_k, float("-inf"), logits)
+                # Top-p (nucleus) filtering
+                if top_p < 1.0:
+                    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
+                    cumsum_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
+                    # Remove tokens with cumulative prob above threshold
+                    sorted_indices_to_remove = cumsum_probs > top_p
+                    sorted_indices_to_remove[:, 1:] = sorted_indices_to_remove[:, :-1].clone()
+                    sorted_indices_to_remove[:, 0] = False
+                    indices_to_remove = sorted_indices_to_remove.scatter(
+                        1, sorted_indices, sorted_indices_to_remove
+                    )
+                    logits = logits.masked_fill(indices_to_remove, float("-inf"))
+                # Sample
+                probs = F.softmax(logits, dim=-1)
+                next_token = torch.multinomial(probs, num_samples=1)
+            else:
+                # Greedy
+                next_token = logits.argmax(dim=-1, keepdim=True)
+            generated = torch.cat([generated, next_token], dim=1)
+            # Stop if max length reached
+            if generated.size(1) >= self.config.max_seq_len:
+                break
+        return generated
+def count_parameters(model: CircuitTransformer) -> int:
+    """Count trainable parameters."""
+    return sum(p.numel() for p in model.parameters() if p.requires_grad)

scripts/__init__.py ADDED Viewed

File without changes

scripts/representation_analysis.py ADDED Viewed

	@@ -0,0 +1,1014 @@

+#!/usr/bin/env python3
+"""
+Representation analysis: CKA and Logit Lens for Prisma / Circuit Transformer.
+CKA (Centered Kernel Alignment):
+  Measures representational similarity between all layer pairs.
+  Produces a heatmap revealing mirror symmetry, phase transitions,
+  and cross-model alignment.
+Logit Lens:
+  Projects intermediate representations to vocabulary space at every layer.
+  Reveals what the model "thinks" at each processing stage -- from raw
+  tokens through the semantic bottleneck back to specific predictions.
+Also computes representation drift (cosine similarity between consecutive layers).
+Usage:
+    # Full analysis (CKA + logit lens)
+    python -m circuits.scripts.representation_analysis \\
+        --checkpoint path/to/checkpoint.pt \\
+        --data hf:HuggingFaceFW/fineweb-edu:sample-10BT:train
+    # Cross-model CKA
+    python -m circuits.scripts.representation_analysis \\
+        --checkpoint path/to/prisma.pt --hf-model gpt2-medium \\
+        --data hf:HuggingFaceFW/fineweb-edu:sample-10BT:train
+    # CKA only (skip logit lens)
+    python -m circuits.scripts.representation_analysis \\
+        --checkpoint path/to/checkpoint.pt \\
+        --data hf:HuggingFaceFW/fineweb-edu:sample-10BT:train \\
+        --no-logit-lens
+"""
+import argparse
+import json
+import sys
+import os
+from pathlib import Path
+from collections import OrderedDict
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+# ---------------------------------------------------------------------------
+# Model loading
+# ---------------------------------------------------------------------------
+def load_prisma_model(checkpoint_path: str, device: str = "cpu"):
+    """Load a Prisma/Circuit checkpoint, return (model, config_dict, model_type)."""
+    sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+    from circuits.config import CircuitConfig
+    from circuits.model import CircuitTransformer
+    from circuits.mirrored import MirroredConfig, MirroredTransformer
+    ckpt = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+    model_type = ckpt.get("model_type", "standard")
+    config_dict = ckpt.get("config", {})
+    if model_type == "mirrored":
+        if config_dict.get("dual_gate_middle"):
+            config_dict.pop("dual_gate_middle")
+        config = MirroredConfig.from_dict(config_dict)
+        model = MirroredTransformer(config)
+    else:
+        config = CircuitConfig.from_dict(config_dict)
+        model = CircuitTransformer(config)
+    state_dict = ckpt["model"]
+    if any(k.startswith("_orig_mod.") for k in state_dict):
+        state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+    model.load_state_dict(state_dict, strict=False)
+    model.to(device).eval()
+    return model, config_dict, model_type
+def load_hf_model(model_name: str, device: str = "cpu"):
+    """Load a HuggingFace causal LM."""
+    from transformers import AutoModelForCausalLM
+    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, trust_remote_code=True)
+    model.to(device).eval()
+    return model
+# ---------------------------------------------------------------------------
+# Data loading
+# ---------------------------------------------------------------------------
+def load_data(data_source: str, tokenizer_name: str, num_samples: int = 32,
+              context_length: int = 512, device: str = "cpu"):
+    """Load tokenized data. Returns (input_ids, tokenizer).
+    Supports:
+      - Memmap .bin files (from circuits training cache)
+      - hf:dataset:config:split (streaming from HuggingFace)
+      - Plain text files
+    """
+    sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+    from circuits.data import get_tokenizer
+    tokenizer = get_tokenizer(tokenizer_name)
+    # Memmap binary file (already tokenized)
+    if data_source.endswith(".bin"):
+        import struct
+        with open(data_source, 'rb') as f:
+            n_chunks, seq_len = struct.unpack('II', f.read(8))
+        data = np.memmap(data_source, dtype=np.int32, mode='r',
+                         offset=8, shape=(n_chunks, seq_len))
+        n = min(num_samples, n_chunks)
+        # Slice to requested context length
+        cl = min(context_length, seq_len)
+        input_ids = torch.from_numpy(data[:n, :cl].copy()).long().to(device)
+        return input_ids, tokenizer
+    # HuggingFace dataset
+    if data_source.startswith("hf:"):
+        from datasets import load_dataset
+        parts = data_source[3:].split(":")
+        ds_name = parts[0]
+        ds_config = parts[1] if len(parts) > 1 else None
+        ds_split = parts[2] if len(parts) > 2 else "train"
+        dataset = load_dataset(ds_name, ds_config, split=ds_split, streaming=True)
+        all_ids = []
+        for item in dataset:
+            text = item.get("text", "")
+            if len(text) < 100:
+                continue
+            ids = tokenizer.encode(text)
+            if len(ids) >= context_length:
+                all_ids.append(ids[:context_length])
+            if len(all_ids) >= num_samples:
+                break
+        if not all_ids:
+            return None, tokenizer
+        return torch.tensor(all_ids, device=device), tokenizer
+    # Plain text file
+    with open(data_source) as f:
+        texts = [line.strip() for line in f if len(line.strip()) > 100]
+    all_ids = []
+    for text in texts:
+        ids = tokenizer.encode(text)
+        if len(ids) >= context_length:
+            all_ids.append(ids[:context_length])
+        if len(all_ids) >= num_samples:
+            break
+    if not all_ids:
+        return None, tokenizer
+    return torch.tensor(all_ids, device=device), tokenizer
+def tokenize_for_hf(texts: list, model_name: str, context_length: int = 512,
+                     device: str = "cpu"):
+    """Tokenize texts for an HF model. Returns (input_ids, tokenizer)."""
+    from transformers import AutoTokenizer
+    tokenizer = AutoTokenizer.from_pretrained(model_name,
+                                              use_fast=False,
+                                              trust_remote_code=True)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    all_ids = []
+    for text in texts:
+        ids = tokenizer.encode(text, max_length=context_length, truncation=True)
+        if len(ids) >= context_length:
+            all_ids.append(ids[:context_length])
+        elif len(ids) > 32:
+            all_ids.append(ids + [tokenizer.eos_token_id] * (context_length - len(ids)))
+    if not all_ids:
+        return None, tokenizer
+    return torch.tensor(all_ids, device=device), tokenizer
+# ---------------------------------------------------------------------------
+# Activation collection
+# ---------------------------------------------------------------------------
+def collect_mirrored_activations(model, input_ids, word_positions=None):
+    """Collect activations from MirroredTransformer at every processing stage."""
+    activations = OrderedDict()
+    with torch.no_grad():
+        x = model.embed(input_ids)
+        if model.embed_proj is not None:
+            if model.embed_g3 is not None:
+                g4 = F.silu(model.embed_g4(x))
+                g3 = F.silu(model.embed_g3(x) * g4)
+                x = model.embed_proj(x) * g3
+            else:
+                x = F.silu(model.embed_proj(x))
+        x = x * model.embed_scale
+        activations["embedding"] = x.detach().cpu()
+        for i, block in enumerate(model.mirror_blocks):
+            x, _ = block(x, word_positions=word_positions)
+            activations[f"expand_{i}"] = x.detach().cpu()
+        for i, block in enumerate(model.middle_blocks):
+            x, _ = block(x, word_positions=word_positions)
+            activations[f"middle_{i}"] = x.detach().cpu()
+        for i in reversed(range(len(model.mirror_blocks))):
+            x, _ = model.mirror_blocks[i](x, word_positions=word_positions)
+            compress_idx = len(model.mirror_blocks) - 1 - i
+            activations[f"compress_{compress_idx}"] = x.detach().cpu()
+        x = model.norm(x)
+        activations["final_norm"] = x.detach().cpu()
+    return activations
+def collect_standard_activations(model, input_ids, word_positions=None):
+    """Collect activations from standard CircuitTransformer."""
+    activations = OrderedDict()
+    with torch.no_grad():
+        x = model.embed(input_ids)
+        if model.embed_proj is not None:
+            x = F.silu(model.embed_proj(x))
+        x = x * model.embed_scale
+        activations["embedding"] = x.detach().cpu()
+        for i, layer in enumerate(model.layers):
+            x, _ = layer(x, word_positions=word_positions)
+            activations[f"layer_{i}"] = x.detach().cpu()
+        x = model.norm(x)
+        activations["final_norm"] = x.detach().cpu()
+    return activations
+def collect_hf_activations(model, input_ids):
+    """Hook-based activation collection for HuggingFace models."""
+    activations = OrderedDict()
+    hooks = []
+    if hasattr(model, 'transformer'):
+        # GPT-2 style
+        blocks = model.transformer.h
+        embed = model.transformer.wte
+        final_norm = model.transformer.ln_f
+    elif hasattr(model, 'model'):
+        # Llama / Mistral style
+        blocks = model.model.layers
+        embed = model.model.embed_tokens
+        final_norm = model.model.norm
+    else:
+        raise ValueError(f"Unsupported HF model: {type(model)}")
+    def make_hook(name):
+        def hook_fn(module, input, output):
+            out = output[0] if isinstance(output, tuple) else output
+            activations[name] = out.detach().cpu()
+        return hook_fn
+    hooks.append(embed.register_forward_hook(make_hook("embedding")))
+    for i, block in enumerate(blocks):
+        hooks.append(block.register_forward_hook(make_hook(f"layer_{i}")))
+    hooks.append(final_norm.register_forward_hook(make_hook("final_norm")))
+    with torch.no_grad():
+        model(input_ids)
+    for h in hooks:
+        h.remove()
+    return activations
+def collect_activations(model, model_type, config_dict, input_ids, device):
+    """Dispatch to the right collector based on model type."""
+    word_positions = None
+    word_rope_dims = config_dict.get("word_rope_dims", 0) if config_dict else 0
+    if word_rope_dims > 0 and model_type in ("standard", "mirrored"):
+        sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+        from circuits.data import get_tokenizer
+        from circuits.layers import build_word_start_table, compute_word_positions
+        tokenizer_name = config_dict.get("tokenizer_name", "gpt2")
+        # Try to get tokenizer from the model's config
+        tokenizer = get_tokenizer(tokenizer_name)
+        word_start_table = build_word_start_table(tokenizer, len(tokenizer)).to(device)
+        word_positions = compute_word_positions(input_ids, word_start_table)
+    if model_type == "mirrored":
+        return collect_mirrored_activations(model, input_ids, word_positions)
+    elif model_type == "standard":
+        return collect_standard_activations(model, input_ids, word_positions)
+    else:
+        return collect_hf_activations(model, input_ids)
+# ---------------------------------------------------------------------------
+# Linear CKA
+# ---------------------------------------------------------------------------
+def linear_cka(X: torch.Tensor, Y: torch.Tensor) -> float:
+    """Compute linear CKA between two [N, D] representation matrices.
+    CKA(X, Y) = ||Yc^T Xc||_F^2 / (||Xc^T Xc||_F * ||Yc^T Yc||_F)
+    """
+    X = X.float()
+    Y = Y.float()
+    # Center
+    X = X - X.mean(0, keepdim=True)
+    Y = Y - Y.mean(0, keepdim=True)
+    N = X.shape[0]
+    if N < min(X.shape[1], Y.shape[1]):
+        # Kernel formulation (N < D): K=XX^T, L=YY^T — [N,N] matrices
+        K = X @ X.T
+        L = Y @ Y.T
+        numerator = (K * L).sum()
+        denominator = torch.sqrt((K * K).sum() * (L * L).sum())
+    else:
+        # Feature formulation (D <= N)
+        XtY = X.T @ Y
+        XtX = X.T @ X
+        YtY = Y.T @ Y
+        numerator = (XtY * XtY).sum()
+        denominator = torch.sqrt((XtX * XtX).sum() * (YtY * YtY).sum())
+    if denominator < 1e-10:
+        return 0.0
+    return (numerator / denominator).item()
+def compute_cka_matrix(activations: dict, subsample: int = 4) -> tuple:
+    """Compute CKA between all layer pairs. Returns (cka_matrix, layer_names)."""
+    names = list(activations.keys())
+    n_layers = len(names)
+    # Flatten and subsample: [B, L, D] -> [N, D]
+    flat_acts = {}
+    for name, act in activations.items():
+        act_sub = act[:, ::subsample, :]
+        flat_acts[name] = act_sub.reshape(-1, act_sub.shape[-1])
+    cka_matrix = np.zeros((n_layers, n_layers))
+    for i in range(n_layers):
+        cka_matrix[i, i] = 1.0
+        for j in range(i + 1, n_layers):
+            cka_val = linear_cka(flat_acts[names[i]], flat_acts[names[j]])
+            cka_matrix[i, j] = cka_val
+            cka_matrix[j, i] = cka_val
+        if (i + 1) % 5 == 0 or i == n_layers - 1:
+            print(f"  CKA: {i+1}/{n_layers} rows computed")
+    return cka_matrix, names
+def compute_cross_model_cka(acts_a: dict, acts_b: dict) -> tuple:
+    """Cross-model CKA using sample-level (avg-pooled) representations."""
+    names_a = list(acts_a.keys())
+    names_b = list(acts_b.keys())
+    def pool(activations):
+        return {name: act.mean(dim=1) for name, act in activations.items()}
+    pooled_a = pool(acts_a)
+    pooled_b = pool(acts_b)
+    # Ensure same number of samples
+    n_samples = min(
+        next(iter(pooled_a.values())).shape[0],
+        next(iter(pooled_b.values())).shape[0]
+    )
+    cka_matrix = np.zeros((len(names_a), len(names_b)))
+    for i, na in enumerate(names_a):
+        for j, nb in enumerate(names_b):
+            cka_matrix[i, j] = linear_cka(pooled_a[na][:n_samples], pooled_b[nb][:n_samples])
+        if (i + 1) % 5 == 0 or i == len(names_a) - 1:
+            print(f"  Cross-CKA: {i+1}/{len(names_a)} rows computed")
+    return cka_matrix, names_a, names_b
+# ---------------------------------------------------------------------------
+# Logit Lens
+# ---------------------------------------------------------------------------
+def get_unembed_components(model, model_type):
+    """Extract (norm_module, unembed_weight) for logit lens projection."""
+    if model_type in ("standard", "mirrored"):
+        return model.norm, model.embed.weight
+    elif hasattr(model, 'transformer'):
+        return model.transformer.ln_f, model.transformer.wte.weight
+    elif hasattr(model, 'model'):
+        return model.model.norm, model.model.embed_tokens.weight
+    else:
+        raise ValueError(f"Unsupported model: {type(model)}")
+def compute_logit_lens(activations: dict, norm: nn.Module, unembed_weight: torch.Tensor,
+                       labels: torch.Tensor, device: str = "cpu",
+                       chunk_size: int = 2048) -> OrderedDict:
+    """Compute logit lens statistics at every layer.
+    Projects intermediate hidden states through final norm + unembedding.
+    Computes entropy, top-1 probability, correct token rank, and
+    agreement with the final layer's predictions.
+    Args:
+        activations: OrderedDict[name] = [B, L, D]
+        norm: final layer norm module
+        unembed_weight: [V, D] unembedding matrix
+        labels: [B, L-1] next-token labels (input_ids[:, 1:])
+        device: computation device
+        chunk_size: number of positions per batch for projection
+    Returns:
+        OrderedDict[name] = {entropy, top1_prob, correct_rank, ...}
+    """
+    names = list(activations.keys())
+    final_name = names[-1]  # "final_norm"
+    results = OrderedDict()
+    unembed = unembed_weight.to(device)
+    norm_mod = norm.to(device)
+    labels_flat = labels.reshape(-1).to(device)
+    def process_layer(name, act, apply_norm=True):
+        """Project one layer's activations and compute all metrics."""
+        B, L, D = act.shape
+        flat = act[:, :-1, :].reshape(-1, D)  # [B*(L-1), D]
+        N = flat.shape[0]
+        all_entropy = []
+        all_top1_prob = []
+        all_correct_rank = []
+        all_top1_idx = []
+        for start in range(0, N, chunk_size):
+            end = min(start + chunk_size, N)
+            chunk = flat[start:end].to(device)
+            chunk_labels = labels_flat[start:end]
+            if apply_norm:
+                chunk = norm_mod(chunk)
+            logits = chunk @ unembed.T  # [cs, V]
+            log_probs = F.log_softmax(logits, dim=-1)
+            probs = log_probs.exp()
+            # Entropy
+            entropy = -(probs * log_probs).sum(dim=-1)
+            all_entropy.append(entropy.cpu())
+            # Top-1 probability
+            top1_prob = probs.max(dim=-1).values
+            all_top1_prob.append(top1_prob.cpu())
+            # Correct token rank
+            correct_logits = logits.gather(1, chunk_labels.unsqueeze(1))
+            rank = (logits > correct_logits).sum(dim=-1) + 1
+            all_correct_rank.append(rank.cpu())
+            # Top-1 index
+            all_top1_idx.append(logits.argmax(dim=-1).cpu())
+        entropy_t = torch.cat(all_entropy)
+        top1_t = torch.cat(all_top1_prob)
+        rank_t = torch.cat(all_correct_rank).float()
+        top1_idx = torch.cat(all_top1_idx)
+        return {
+            "entropy": entropy_t.mean().item(),
+            "entropy_std": entropy_t.std().item(),
+            "top1_prob": top1_t.mean().item(),
+            "correct_rank_mean": rank_t.mean().item(),
+            "correct_rank_median": rank_t.median().item(),
+            "log_rank_mean": rank_t.log().mean().item(),
+            "_top1_idx": top1_idx,
+        }
+    # Process all layers
+    for name in names:
+        is_final = (name == final_name)
+        act = activations[name]
+        stats = process_layer(name, act, apply_norm=not is_final)
+        results[name] = stats
+        print(f"  Logit lens: {name:20s}  entropy={stats['entropy']:.2f}  "
+              f"top1={stats['top1_prob']:.4f}  rank={stats['correct_rank_median']:.0f}")
+    # Compute agreement with final layer
+    final_top1 = results[final_name]["_top1_idx"]
+    for name in names:
+        layer_top1 = results[name]["_top1_idx"]
+        agreement = (layer_top1 == final_top1).float().mean().item()
+        results[name]["agreement_with_final"] = agreement
+    # Clean up internal tensors
+    for name in names:
+        del results[name]["_top1_idx"]
+    return results
+# ---------------------------------------------------------------------------
+# Representation drift
+# ---------------------------------------------------------------------------
+def compute_drift(activations: dict) -> OrderedDict:
+    """Cosine similarity between consecutive layers' representations."""
+    names = list(activations.keys())
+    drift = OrderedDict()
+    for i in range(1, len(names)):
+        prev = activations[names[i - 1]]
+        curr = activations[names[i]]
+        # Flatten to [N, D]
+        prev_flat = prev.reshape(-1, prev.shape[-1])
+        curr_flat = curr.reshape(-1, curr.shape[-1])
+        # Mean cosine similarity
+        cos = F.cosine_similarity(prev_flat, curr_flat, dim=-1)
+        drift[names[i]] = {
+            "cos_sim_mean": cos.mean().item(),
+            "cos_sim_std": cos.std().item(),
+            "l2_distance": (curr_flat - prev_flat).norm(dim=-1).mean().item(),
+        }
+    return drift
+# ---------------------------------------------------------------------------
+# Plotting
+# ---------------------------------------------------------------------------
+def _phase_color(name):
+    """Return color based on layer phase."""
+    if "expand" in name:
+        return "steelblue"
+    elif "middle" in name:
+        return "goldenrod"
+    elif "compress" in name:
+        return "coral"
+    elif "embedding" in name:
+        return "gray"
+    elif "final" in name:
+        return "gray"
+    else:
+        return "mediumpurple"
+def _layer_sort_key(name):
+    """Sort key for processing order."""
+    order = {"embedding": -1, "final_norm": 9999}
+    if name in order:
+        return order[name]
+    parts = name.split("_")
+    phase = parts[0]
+    idx = int(parts[1]) if len(parts) > 1 and parts[1].isdigit() else 0
+    phase_offset = {"expand": 0, "middle": 1000, "compress": 2000, "layer": 0}
+    return phase_offset.get(phase, 3000) + idx
+def _short_name(name):
+    """Shorten layer name for plot labels."""
+    if name == "embedding":
+        return "emb"
+    if name == "final_norm":
+        return "out"
+    parts = name.split("_")
+    if parts[0] == "expand":
+        return f"E{parts[1]}"
+    elif parts[0] == "middle":
+        return f"M{parts[1]}"
+    elif parts[0] == "compress":
+        return f"C{parts[1]}"
+    elif parts[0] == "layer":
+        return f"L{parts[1]}"
+    return name[:6]
+def plot_cka_self(cka_matrix: np.ndarray, names: list, output_dir: Path,
+                  model_label: str):
+    """Plot self-CKA heatmap."""
+    n = len(names)
+    short = [_short_name(n) for n in names]
+    fig, ax = plt.subplots(figsize=(max(10, n * 0.35), max(8, n * 0.3)))
+    fig.suptitle(f"{model_label} -- CKA Self-Similarity", fontsize=14)
+    im = ax.imshow(cka_matrix, cmap="inferno", vmin=0, vmax=1, aspect="equal")
+    # Phase separators
+    for i, name in enumerate(names):
+        if i > 0:
+            prev = names[i - 1].split("_")[0]
+            curr = name.split("_")[0]
+            if prev != curr:
+                ax.axhline(i - 0.5, color="white", linewidth=1.5, alpha=0.8)
+                ax.axvline(i - 0.5, color="white", linewidth=1.5, alpha=0.8)
+    ax.set_xticks(range(n))
+    ax.set_xticklabels(short, rotation=90, fontsize=7)
+    ax.set_yticks(range(n))
+    ax.set_yticklabels(short, fontsize=7)
+    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="CKA")
+    plt.tight_layout()
+    fig.savefig(output_dir / "cka_self.png", dpi=150)
+    plt.close(fig)
+def plot_cka_cross(cka_matrix: np.ndarray, names_a: list, names_b: list,
+                   output_dir: Path, label_a: str, label_b: str):
+    """Plot cross-model CKA heatmap."""
+    short_a = [_short_name(n) for n in names_a]
+    short_b = [_short_name(n) for n in names_b]
+    na, nb = len(names_a), len(names_b)
+    fig, ax = plt.subplots(figsize=(max(10, nb * 0.35), max(8, na * 0.3)))
+    fig.suptitle(f"Cross-CKA: {label_a} vs {label_b}", fontsize=14)
+    im = ax.imshow(cka_matrix, cmap="inferno", vmin=0, vmax=1, aspect="auto")
+    ax.set_xticks(range(nb))
+    ax.set_xticklabels(short_b, rotation=90, fontsize=7)
+    ax.set_xlabel(label_b)
+    ax.set_yticks(range(na))
+    ax.set_yticklabels(short_a, fontsize=7)
+    ax.set_ylabel(label_a)
+    plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04, label="CKA")
+    plt.tight_layout()
+    fig.savefig(output_dir / "cka_cross.png", dpi=150)
+    plt.close(fig)
+def plot_logit_lens(lens_results: OrderedDict, output_dir: Path,
+                    model_label: str):
+    """Plot logit lens summary: entropy, confidence, rank, agreement."""
+    names = list(lens_results.keys())
+    sorted_names = sorted(names, key=_layer_sort_key)
+    short = [_short_name(n) for n in sorted_names]
+    colors = [_phase_color(n) for n in sorted_names]
+    x = range(len(sorted_names))
+    fig, axes = plt.subplots(2, 2, figsize=(16, 10))
+    fig.suptitle(f"{model_label} -- Logit Lens", fontsize=14)
+    # Entropy
+    vals = [lens_results[n]["entropy"] for n in sorted_names]
+    axes[0, 0].bar(x, vals, color=colors, alpha=0.85)
+    axes[0, 0].set_ylabel("Entropy (nats)")
+    axes[0, 0].set_title("Prediction entropy per layer")
+    axes[0, 0].set_xticks(x)
+    axes[0, 0].set_xticklabels(short, rotation=90, fontsize=7)
+    # Top-1 probability
+    vals = [lens_results[n]["top1_prob"] for n in sorted_names]
+    axes[0, 1].bar(x, vals, color=colors, alpha=0.85)
+    axes[0, 1].set_ylabel("Top-1 probability")
+    axes[0, 1].set_title("Prediction confidence per layer")
+    axes[0, 1].set_xticks(x)
+    axes[0, 1].set_xticklabels(short, rotation=90, fontsize=7)
+    # Correct rank (log scale)
+    vals = [lens_results[n]["correct_rank_median"] for n in sorted_names]
+    axes[1, 0].bar(x, vals, color=colors, alpha=0.85)
+    axes[1, 0].set_ylabel("Median rank of correct token")
+    axes[1, 0].set_yscale("log")
+    axes[1, 0].set_title("When does the model find the answer?")
+    axes[1, 0].set_xticks(x)
+    axes[1, 0].set_xticklabels(short, rotation=90, fontsize=7)
+    # Agreement with final layer
+    vals = [lens_results[n]["agreement_with_final"] for n in sorted_names]
+    axes[1, 1].bar(x, vals, color=colors, alpha=0.85)
+    axes[1, 1].set_ylabel("Agreement with final layer")
+    axes[1, 1].set_title("Convergence toward final prediction")
+    axes[1, 1].set_ylim(0, 1.05)
+    axes[1, 1].set_xticks(x)
+    axes[1, 1].set_xticklabels(short, rotation=90, fontsize=7)
+    plt.tight_layout()
+    fig.savefig(output_dir / "logit_lens_summary.png", dpi=150)
+    plt.close(fig)
+def plot_logit_lens_trajectory(activations: dict, norm: nn.Module,
+                                unembed_weight: torch.Tensor, input_ids: torch.Tensor,
+                                tokenizer, output_dir: Path, model_label: str,
+                                device: str = "cpu",
+                                n_positions: int = 6, n_layers: int = 10):
+    """Show top-5 predicted tokens at selected layers for a few positions.
+    Picks positions spread across the first sample and shows how the
+    model's prediction evolves through the network.
+    """
+    names = sorted(activations.keys(), key=_layer_sort_key)
+    # Select layers evenly spread across the network
+    if len(names) > n_layers:
+        indices = np.linspace(0, len(names) - 1, n_layers, dtype=int)
+        selected_layers = [names[i] for i in indices]
+    else:
+        selected_layers = names
+    # Select positions from the first sample
+    seq_len = input_ids.shape[1]
+    pos_indices = np.linspace(10, seq_len - 2, n_positions, dtype=int)
+    unembed = unembed_weight.to(device)
+    norm_mod = norm.to(device)
+    final_name = names[-1]
+    fig, axes = plt.subplots(n_positions, 1, figsize=(14, 3 * n_positions))
+    if n_positions == 1:
+        axes = [axes]
+    fig.suptitle(f"{model_label} -- Token prediction trajectory", fontsize=14, y=1.02)
+    for pos_idx, pos in enumerate(pos_indices):
+        ax = axes[pos_idx]
+        actual_token = tokenizer.decode([input_ids[0, pos + 1].item()])
+        context = tokenizer.decode(input_ids[0, max(0, pos - 5):pos + 1].tolist())
+        layer_labels = []
+        top_tokens_per_layer = []
+        for name in selected_layers:
+            is_final = (name == final_name)
+            hidden = activations[name][0, pos:pos + 1, :].to(device)  # [1, D]
+            if not is_final:
+                hidden = norm_mod(hidden)
+            logits = (hidden @ unembed.T).squeeze(0)  # [V]
+            probs = F.softmax(logits, dim=-1)
+            top5_vals, top5_idx = probs.topk(5)
+            tokens_str = []
+            for val, idx in zip(top5_vals, top5_idx):
+                tok = tokenizer.decode([idx.item()]).replace("\n", "\\n")
+                tokens_str.append(f"{tok}({val:.2f})")
+            layer_labels.append(_short_name(name))
+            top_tokens_per_layer.append("\n".join(tokens_str))
+        # Create a text table
+        ax.set_xlim(-0.5, len(layer_labels) - 0.5)
+        ax.set_ylim(-0.5, 5.5)
+        ax.set_xticks(range(len(layer_labels)))
+        ax.set_xticklabels(layer_labels, fontsize=8)
+        ax.set_yticks([])
+        for li, tokens_str in enumerate(top_tokens_per_layer):
+            lines = tokens_str.split("\n")
+            for rank, line in enumerate(lines):
+                color = "darkgreen" if actual_token.strip() in line else "black"
+                fontweight = "bold" if actual_token.strip() in line else "normal"
+                ax.text(li, rank, line, ha="center", va="center", fontsize=7,
+                        color=color, fontweight=fontweight)
+        ax.set_title(f'pos {pos}: "...{context}" -> [{actual_token.strip()}]',
+                     fontsize=9, loc="left")
+        ax.invert_yaxis()
+        ax.spines["top"].set_visible(False)
+        ax.spines["right"].set_visible(False)
+        ax.spines["left"].set_visible(False)
+    plt.tight_layout()
+    fig.savefig(output_dir / "logit_lens_trajectory.png", dpi=150, bbox_inches="tight")
+    plt.close(fig)
+def plot_drift(drift: OrderedDict, output_dir: Path, model_label: str):
+    """Plot representation drift between consecutive layers."""
+    names = list(drift.keys())
+    sorted_names = sorted(names, key=_layer_sort_key)
+    short = [_short_name(n) for n in sorted_names]
+    colors = [_phase_color(n) for n in sorted_names]
+    x = range(len(sorted_names))
+    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+    fig.suptitle(f"{model_label} -- Representation drift", fontsize=14)
+    # Cosine similarity with previous layer
+    vals = [drift[n]["cos_sim_mean"] for n in sorted_names]
+    axes[0].bar(x, vals, color=colors, alpha=0.85)
+    axes[0].set_ylabel("Cosine similarity with previous layer")
+    axes[0].set_title("How much each layer preserves direction")
+    axes[0].set_xticks(x)
+    axes[0].set_xticklabels(short, rotation=90, fontsize=7)
+    # L2 distance
+    vals = [drift[n]["l2_distance"] for n in sorted_names]
+    axes[1].bar(x, vals, color=colors, alpha=0.85)
+    axes[1].set_ylabel("L2 distance from previous layer")
+    axes[1].set_title("How much each layer changes magnitude")
+    axes[1].set_xticks(x)
+    axes[1].set_xticklabels(short, rotation=90, fontsize=7)
+    plt.tight_layout()
+    fig.savefig(output_dir / "representation_drift.png", dpi=150)
+    plt.close(fig)
+# ---------------------------------------------------------------------------
+# Results saving
+# ---------------------------------------------------------------------------
+def save_results(cka_matrix, cka_names, lens_results, drift, cross_cka, output_dir):
+    """Save all numerical results to JSON."""
+    out = {}
+    if cka_matrix is not None:
+        out["cka_self"] = {
+            "names": cka_names,
+            "matrix": cka_matrix.tolist(),
+        }
+    if lens_results:
+        out["logit_lens"] = {name: data for name, data in lens_results.items()}
+    if drift:
+        out["drift"] = {name: data for name, data in drift.items()}
+    if cross_cka is not None:
+        matrix, names_a, names_b = cross_cka
+        out["cka_cross"] = {
+            "names_a": names_a,
+            "names_b": names_b,
+            "matrix": matrix.tolist(),
+        }
+    with open(output_dir / "results.json", "w") as f:
+        json.dump(out, f, indent=2, default=str)
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(
+        description="CKA and Logit Lens analysis for Prisma / Circuit Transformer")
+    parser.add_argument("--checkpoint", type=str, required=True,
+                        help="Path to Prisma/Circuit checkpoint")
+    parser.add_argument("--checkpoint-b", type=str, default=None,
+                        help="Second Prisma checkpoint for cross-model CKA")
+    parser.add_argument("--hf-model", type=str, default=None,
+                        help="HuggingFace model for cross-model CKA (e.g. gpt2-medium)")
+    parser.add_argument("--data", type=str, required=True,
+                        help="Data source (hf:dataset:config:split or file path)")
+    parser.add_argument("--num-samples", type=int, default=32,
+                        help="Number of text samples (default: 32)")
+    parser.add_argument("--context-length", type=int, default=512,
+                        help="Sequence length (default: 512)")
+    parser.add_argument("--cka-subsample", type=int, default=4,
+                        help="Position subsampling for CKA (default: 4)")
+    parser.add_argument("--no-logit-lens", action="store_true",
+                        help="Skip logit lens analysis")
+    parser.add_argument("--no-cka", action="store_true",
+                        help="Skip CKA analysis")
+    parser.add_argument("--output-dir", type=str, default=None,
+                        help="Output directory (default: auto)")
+    parser.add_argument("--gpu", type=int, default=0, help="GPU index")
+    args = parser.parse_args()
+    device = f"cuda:{args.gpu}" if torch.cuda.is_available() else "cpu"
+    print(f"Device: {device}")
+    # Output directory
+    if args.output_dir:
+        output_dir = Path(args.output_dir)
+    else:
+        ckpt_name = Path(args.checkpoint).parent.name
+        output_dir = Path("circuits/scripts/representation_output") / ckpt_name
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Output: {output_dir}")
+    # === Load model A ===
+    print(f"\nLoading: {args.checkpoint}")
+    model_a, config_a, model_type_a = load_prisma_model(args.checkpoint, device)
+    label_a = Path(args.checkpoint).parent.name
+    n_params = sum(p.numel() for p in model_a.parameters())
+    print(f"  Type: {model_type_a}, params: {n_params:,}")
+    # === Load data ===
+    ckpt_data = torch.load(args.checkpoint, map_location="cpu", weights_only=False)
+    tokenizer_name = ckpt_data.get("tokenizer_name", config_a.get("tokenizer_name", "gpt2"))
+    del ckpt_data
+    print(f"\nLoading data ({args.num_samples} samples, ctx={args.context_length})...")
+    result = load_data(
+        args.data, tokenizer_name, args.num_samples, args.context_length, device
+    )
+    if result[0] is None:
+        print("ERROR: No valid samples loaded.")
+        return
+    input_ids, tokenizer = result
+    print(f"  Data shape: {input_ids.shape}")
+    # === Collect activations (model A) ===
+    print(f"\nCollecting activations ({model_type_a})...")
+    acts_a = collect_activations(model_a, model_type_a, config_a, input_ids, device)
+    print(f"  Collected {len(acts_a)} layers")
+    # Free GPU memory
+    del model_a
+    if device.startswith("cuda"):
+        torch.cuda.empty_cache()
+    # === CKA (self) ===
+    cka_matrix = None
+    cka_names = None
+    if not args.no_cka:
+        print(f"\nComputing self-CKA (subsample={args.cka_subsample})...")
+        cka_matrix, cka_names = compute_cka_matrix(acts_a, subsample=args.cka_subsample)
+        plot_cka_self(cka_matrix, cka_names, output_dir, label_a)
+        print(f"  Saved: cka_self.png")
+    # === Cross-model CKA ===
+    cross_cka = None
+    if not args.no_cka and (args.checkpoint_b or args.hf_model):
+        if args.checkpoint_b:
+            print(f"\nLoading comparison: {args.checkpoint_b}")
+            model_b, config_b, model_type_b = load_prisma_model(args.checkpoint_b, device)
+            label_b = Path(args.checkpoint_b).parent.name
+            acts_b = collect_activations(model_b, model_type_b, config_b, input_ids, device)
+            del model_b
+        else:
+            print(f"\nLoading HF model: {args.hf_model}")
+            model_b = load_hf_model(args.hf_model, device)
+            label_b = args.hf_model
+            # Decode texts from our tokens and re-tokenize for HF model
+            print(f"  Re-tokenizing for {args.hf_model}...")
+            raw_texts = [tokenizer.decode(input_ids[i].tolist()) for i in range(input_ids.shape[0])]
+            input_ids_b, _ = tokenize_for_hf(
+                raw_texts, args.hf_model, args.context_length, device
+            )
+            if input_ids_b is not None:
+                print(f"  HF data shape: {input_ids_b.shape}")
+                acts_b = collect_hf_activations(model_b, input_ids_b)
+            else:
+                acts_b = None
+            del model_b
+        if device.startswith("cuda"):
+            torch.cuda.empty_cache()
+        if acts_b:
+            print(f"\nComputing cross-model CKA...")
+            cross_matrix, cross_names_a, cross_names_b = compute_cross_model_cka(acts_a, acts_b)
+            cross_cka = (cross_matrix, cross_names_a, cross_names_b)
+            plot_cka_cross(cross_matrix, cross_names_a, cross_names_b,
+                           output_dir, label_a, label_b)
+            print(f"  Saved: cka_cross.png")
+            del acts_b
+    # === Logit lens ===
+    lens_results = None
+    if not args.no_logit_lens:
+        # Reload model for unembedding components (we deleted it for memory)
+        print(f"\nReloading model for logit lens...")
+        model_a, _, _ = load_prisma_model(args.checkpoint, device)
+        norm, unembed_weight = get_unembed_components(model_a, model_type_a)
+        labels = input_ids[:, 1:].cpu()  # next-token labels
+        print(f"Computing logit lens...")
+        lens_results = compute_logit_lens(acts_a, norm, unembed_weight, labels, device)
+        plot_logit_lens(lens_results, output_dir, label_a)
+        print(f"  Saved: logit_lens_summary.png")
+        # Token trajectory visualization
+        print(f"  Generating token trajectories...")
+        plot_logit_lens_trajectory(
+            acts_a, norm, unembed_weight, input_ids.cpu(), tokenizer,
+            output_dir, label_a, device
+        )
+        print(f"  Saved: logit_lens_trajectory.png")
+        del model_a
+        if device.startswith("cuda"):
+            torch.cuda.empty_cache()
+    # === Representation drift ===
+    print(f"\nComputing representation drift...")
+    drift = compute_drift(acts_a)
+    plot_drift(drift, output_dir, label_a)
+    print(f"  Saved: representation_drift.png")
+    # === Save results ===
+    save_results(cka_matrix, cka_names, lens_results, drift, cross_cka, output_dir)
+    print(f"\nAll outputs saved to: {output_dir}")
+    n_plots = len(list(output_dir.glob("*.png")))
+    print(f"  Plots: {n_plots} PNG files")
+    print(f"  Data:  results.json")
+if __name__ == "__main__":
+    main()

scripts/spectral_analysis.py ADDED Viewed

	@@ -0,0 +1,969 @@

+#!/usr/bin/env python3
+"""
+Spectral analysis of Prisma / Circuit Transformer checkpoints.
+Computes SVD spectra of weight matrices and (optionally) activation covariances,
+revealing how the model organizes information geometrically.
+Analyses:
+  1. Weight spectra     — singular value distributions per matrix
+  2. Effective rank     — how many dimensions carry real signal
+  3. Power-law fit      — Martin & Mahoney alpha exponent (training quality)
+  4. MP bound           — Marchenko-Pastur separation of signal vs noise
+  5. Mirror comparison  — expand vs compress activation spectra (Prisma-specific)
+  6. Embedding alignment— spectral similarity between embed and final hidden states
+  7. Layer-wise summary — effective rank progression through the network (the lens)
+Usage:
+    # Weight-only analysis (no data needed)
+    python -m circuits.scripts.spectral_analysis --checkpoint path/to/checkpoint.pt
+    # Full analysis with activation spectra (needs data)
+    python -m circuits.scripts.spectral_analysis --checkpoint path/to/checkpoint.pt \
+        --data hf:HuggingFaceFW/fineweb-edu:sample-10BT:train --num-samples 512
+    # Compare two checkpoints
+    python -m circuits.scripts.spectral_analysis \
+        --checkpoint path/to/prisma.pt --checkpoint-b path/to/standard.pt
+    # Compare against HuggingFace model
+    python -m circuits.scripts.spectral_analysis \
+        --checkpoint path/to/prisma.pt --hf-model gpt2-medium
+"""
+import argparse
+import json
+import sys
+import os
+from pathlib import Path
+from collections import defaultdict
+import numpy as np
+import torch
+import torch.nn as nn
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+from matplotlib.gridspec import GridSpec
+# ---------------------------------------------------------------------------
+# Model loading
+# ---------------------------------------------------------------------------
+def load_prisma_model(checkpoint_path: str, device: str = "cpu"):
+    """Load a Prisma/Circuit checkpoint, return (model, config_dict, model_type)."""
+    sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+    from circuits.config import CircuitConfig
+    from circuits.model import CircuitTransformer
+    from circuits.mirrored import MirroredConfig, MirroredTransformer
+    ckpt = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
+    model_type = ckpt.get("model_type", "standard")
+    config_dict = ckpt.get("config", {})
+    if model_type == "mirrored":
+        if config_dict.get("dual_gate_middle"):
+            config_dict.pop("dual_gate_middle")
+        config = MirroredConfig.from_dict(config_dict)
+        model = MirroredTransformer(config)
+    else:
+        config = CircuitConfig.from_dict(config_dict)
+        model = CircuitTransformer(config)
+    state_dict = ckpt["model"]
+    if any(k.startswith("_orig_mod.") for k in state_dict):
+        state_dict = {k.removeprefix("_orig_mod."): v for k, v in state_dict.items()}
+    model.load_state_dict(state_dict, strict=False)
+    model.to(device).eval()
+    return model, config_dict, model_type
+def load_hf_model(model_name: str, device: str = "cpu"):
+    """Load a HuggingFace causal LM."""
+    from transformers import AutoModelForCausalLM
+    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32)
+    model.to(device).eval()
+    return model
+# ---------------------------------------------------------------------------
+# SVD utilities
+# ---------------------------------------------------------------------------
+def compute_singular_values(weight: torch.Tensor) -> np.ndarray:
+    """Compute singular values of a 2D weight matrix."""
+    w = weight.detach().float().cpu()
+    if w.ndim != 2:
+        return None
+    sv = torch.linalg.svdvals(w).numpy()
+    return sv
+def effective_rank(sv: np.ndarray) -> float:
+    """Entropy-based effective rank (Roy & Vetterli, 2007).
+    erank = exp(H(p)) where p_i = sigma_i / sum(sigma)
+    and H is Shannon entropy. Ranges from 1 (rank-1) to min(m,n) (full rank).
+    """
+    sv = sv[sv > 1e-10]
+    if len(sv) == 0:
+        return 0.0
+    p = sv / sv.sum()
+    entropy = -(p * np.log(p)).sum()
+    return float(np.exp(entropy))
+def stable_rank(sv: np.ndarray) -> float:
+    """Stable rank = ||W||_F^2 / ||W||_2^2 = sum(sigma^2) / max(sigma)^2."""
+    if len(sv) == 0 or sv[0] < 1e-10:
+        return 0.0
+    return float((sv ** 2).sum() / (sv[0] ** 2))
+def marchenko_pastur_bound(m: int, n: int, sv: np.ndarray) -> float:
+    """Estimate Marchenko-Pastur upper edge.
+    For a random matrix with variance sigma^2, the MP upper bound is
+    sigma * (1 + sqrt(m/n))^2  (assuming m >= n).
+    We estimate sigma from the bulk of singular values.
+    """
+    gamma = max(m, n) / min(m, n)
+    # Estimate noise level from bottom half of spectrum
+    bottom_half = sv[len(sv) // 2:]
+    if len(bottom_half) == 0:
+        return sv[-1] if len(sv) > 0 else 0.0
+    sigma_est = float(np.median(bottom_half)) / np.sqrt(max(m, n))
+    mp_upper = sigma_est * (1.0 + np.sqrt(gamma)) ** 2 * np.sqrt(min(m, n))
+    return mp_upper
+def fit_power_law(sv: np.ndarray, fit_fraction: float = 0.8) -> tuple[float, float]:
+    """Fit power law to singular value distribution tail.
+    Returns (alpha, r_squared). alpha < 2 = heavy-tailed (well-trained).
+    """
+    sv = sv[sv > 1e-10]
+    if len(sv) < 10:
+        return 0.0, 0.0
+    # Fit to the top `fit_fraction` of the spectrum (exclude noise floor)
+    n_fit = max(10, int(len(sv) * fit_fraction))
+    sv_fit = sv[:n_fit]
+    log_rank = np.log(np.arange(1, n_fit + 1))
+    log_sv = np.log(sv_fit)
+    # Linear regression in log-log space: log(sv) = -alpha * log(rank) + c
+    coeffs = np.polyfit(log_rank, log_sv, 1)
+    alpha = -coeffs[0]
+    # R-squared
+    predicted = np.polyval(coeffs, log_rank)
+    ss_res = ((log_sv - predicted) ** 2).sum()
+    ss_tot = ((log_sv - log_sv.mean()) ** 2).sum()
+    r_sq = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
+    return float(alpha), float(r_sq)
+# ---------------------------------------------------------------------------
+# Weight spectrum analysis
+# ---------------------------------------------------------------------------
+def analyze_weight_spectra(model: nn.Module, model_label: str = "model") -> dict:
+    """Compute SVD spectra for all 2D weight matrices."""
+    results = {}
+    for name, param in model.named_parameters():
+        if param.ndim != 2:
+            continue
+        sv = compute_singular_values(param)
+        if sv is None:
+            continue
+        m, n = param.shape
+        mp_bound = marchenko_pastur_bound(m, n, sv)
+        n_above_mp = int((sv > mp_bound).sum())
+        alpha, r_sq = fit_power_law(sv)
+        results[name] = {
+            "shape": (m, n),
+            "singular_values": sv,
+            "effective_rank": effective_rank(sv),
+            "stable_rank": stable_rank(sv),
+            "spectral_norm": float(sv[0]),
+            "frobenius_norm": float(np.sqrt((sv ** 2).sum())),
+            "mp_bound": mp_bound,
+            "n_above_mp": n_above_mp,
+            "n_total": len(sv),
+            "signal_ratio": n_above_mp / len(sv) if len(sv) > 0 else 0,
+            "alpha": alpha,
+            "alpha_r2": r_sq,
+            "condition_number": float(sv[0] / sv[-1]) if sv[-1] > 1e-10 else float("inf"),
+        }
+    return results
+# ---------------------------------------------------------------------------
+# Activation spectrum analysis
+# ---------------------------------------------------------------------------
+def collect_activations(model, input_ids: torch.Tensor,
+                        word_positions: torch.Tensor = None,
+                        model_type: str = "standard") -> dict[str, torch.Tensor]:
+    """Run a forward pass and collect intermediate activations via hooks."""
+    activations = {}
+    hooks = []
+    def make_hook(name):
+        def hook_fn(module, input, output):
+            if isinstance(output, tuple):
+                out = output[0]
+            else:
+                out = output
+            # Store mean over batch and sequence for covariance
+            activations[name] = out.detach().float().cpu()
+        return hook_fn
+    # Register hooks based on model type
+    if model_type == "mirrored":
+        # Expand phase
+        for i, block in enumerate(model.mirror_blocks):
+            hooks.append(block.register_forward_hook(make_hook(f"expand_{i}")))
+        # Middle
+        for i, block in enumerate(model.middle_blocks):
+            hooks.append(block.register_forward_hook(make_hook(f"middle_{i}")))
+        # Compress — mirror blocks are reused in reverse, so we hook the FFN output
+        # We'll collect compress activations differently via a custom forward
+    else:
+        for i, block in enumerate(model.layers):
+            hooks.append(block.register_forward_hook(make_hook(f"layer_{i}")))
+    # Also hook the embedding output
+    hooks.append(model.embed.register_forward_hook(make_hook("embedding")))
+    with torch.no_grad():
+        kwargs = {}
+        if word_positions is not None:
+            kwargs["word_positions"] = word_positions
+        model(input_ids, **kwargs)
+    for h in hooks:
+        h.remove()
+    return activations
+def collect_mirrored_activations(model, input_ids: torch.Tensor,
+                                  word_positions: torch.Tensor = None) -> dict[str, torch.Tensor]:
+    """Collect activations from a MirroredTransformer, separating expand and compress phases.
+    This manually runs the forward pass to capture compress-phase activations
+    from the reversed mirror blocks.
+    """
+    import math
+    activations = {}
+    with torch.no_grad():
+        # Embed
+        x = model.embed(input_ids)
+        if model.embed_proj is not None:
+            import torch.nn.functional as F
+            if model.embed_g3 is not None:
+                g4 = F.silu(model.embed_g4(x))
+                g3 = F.silu(model.embed_g3(x) * g4)
+                x = model.embed_proj(x) * g3
+            else:
+                x = F.silu(model.embed_proj(x))
+        x = x * model.embed_scale
+        activations["embedding"] = x.detach().float().cpu()
+        # Expand phase
+        for i, block in enumerate(model.mirror_blocks):
+            x, _ = block(x, word_positions=word_positions)
+            activations[f"expand_{i}"] = x.detach().float().cpu()
+        # Middle phase
+        for i, block in enumerate(model.middle_blocks):
+            x, _ = block(x, word_positions=word_positions)
+            activations[f"middle_{i}"] = x.detach().float().cpu()
+        # Compress phase (reversed)
+        for i in reversed(range(len(model.mirror_blocks))):
+            x, _ = model.mirror_blocks[i](x, word_positions=word_positions)
+            compress_idx = len(model.mirror_blocks) - 1 - i
+            activations[f"compress_{compress_idx}"] = x.detach().float().cpu()
+        # Final norm
+        x = model.norm(x)
+        activations["final_norm"] = x.detach().float().cpu()
+    return activations
+def activation_spectrum(act: torch.Tensor, max_components: int = 256) -> dict:
+    """Compute eigenspectrum of activation covariance.
+    act: [B, T, D] — reshape to [B*T, D], compute covariance, eigendecompose.
+    """
+    # Flatten batch and sequence
+    flat = act.reshape(-1, act.shape[-1])  # [N, D]
+    N, D = flat.shape
+    if N < 2:
+        return None
+    # Center
+    flat = flat - flat.mean(dim=0, keepdim=True)
+    # Compute covariance via SVD of the data matrix (more stable than cov matrix)
+    n_components = min(max_components, D, N)
+    try:
+        U, S, Vh = torch.pca_lowrank(flat, q=n_components)
+        eigenvalues = (S ** 2 / (N - 1)).numpy()
+    except Exception:
+        # Fallback: full covariance
+        cov = (flat.T @ flat) / (N - 1)
+        eigenvalues = torch.linalg.eigvalsh(cov).flip(0).numpy()
+        eigenvalues = eigenvalues[:max_components]
+    eigenvalues = eigenvalues[eigenvalues > 1e-10]
+    return {
+        "eigenvalues": eigenvalues,
+        "effective_rank": effective_rank(np.sqrt(np.maximum(eigenvalues, 0))),
+        "total_variance": float(eigenvalues.sum()),
+        "top1_variance_ratio": float(eigenvalues[0] / eigenvalues.sum()) if len(eigenvalues) > 0 else 0,
+        "top10_variance_ratio": float(eigenvalues[:10].sum() / eigenvalues.sum()) if len(eigenvalues) >= 10 else 0,
+        "n_components": len(eigenvalues),
+    }
+# ---------------------------------------------------------------------------
+# Plotting
+# ---------------------------------------------------------------------------
+def plot_weight_spectra(results: dict, output_dir: Path, model_label: str = "model",
+                        results_b: dict = None, model_b_label: str = "model_b"):
+    """Plot singular value distributions for all weight matrices."""
+    # Group by layer/component type
+    groups = defaultdict(list)
+    for name, data in results.items():
+        # Identify the component type
+        if "attn" in name and ("q_proj" in name or "wq" in name):
+            groups["attention_Q"].append((name, data))
+        elif "attn" in name and ("k_proj" in name or "wk" in name):
+            groups["attention_K"].append((name, data))
+        elif "attn" in name and ("v_proj" in name or "wv" in name):
+            groups["attention_V"].append((name, data))
+        elif "attn" in name and ("o_proj" in name or "wo" in name):
+            groups["attention_O"].append((name, data))
+        elif "w1" in name or "up_proj" in name:
+            groups["ffn_W1"].append((name, data))
+        elif "w2" in name or "down_proj" in name:
+            groups["ffn_W2"].append((name, data))
+        elif "w3" in name or "gate_proj" in name:
+            groups["ffn_gate_W3"].append((name, data))
+        elif "w4" in name:
+            groups["ffn_gate_W4"].append((name, data))
+        elif "embed" in name or "wte" in name:
+            groups["embedding"].append((name, data))
+        elif "lm_head" in name:
+            groups["lm_head"].append((name, data))
+        else:
+            groups["other"].append((name, data))
+    # Plot each group
+    for group_name, items in groups.items():
+        if not items:
+            continue
+        fig, axes = plt.subplots(1, 2, figsize=(14, 5))
+        fig.suptitle(f"{model_label} — {group_name} weight spectra", fontsize=13)
+        ax_linear, ax_log = axes
+        cmap = plt.cm.viridis(np.linspace(0.1, 0.9, len(items)))
+        for idx, (name, data) in enumerate(items):
+            sv = data["singular_values"]
+            short_name = name.split(".")[-2] + "." + name.split(".")[-1] if "." in name else name
+            ax_linear.plot(sv, color=cmap[idx], alpha=0.7, linewidth=0.8, label=short_name)
+            ax_log.loglog(np.arange(1, len(sv) + 1), sv, color=cmap[idx], alpha=0.7,
+                         linewidth=0.8, label=short_name)
+            # MP bound
+            ax_linear.axhline(data["mp_bound"], color=cmap[idx], linestyle=":", alpha=0.3)
+        ax_linear.set_xlabel("Rank")
+        ax_linear.set_ylabel("Singular value")
+        ax_linear.set_title("Linear scale")
+        ax_linear.legend(fontsize=6, ncol=2)
+        ax_log.set_xlabel("Rank")
+        ax_log.set_ylabel("Singular value")
+        ax_log.set_title("Log-log scale (power law)")
+        ax_log.legend(fontsize=6, ncol=2)
+        plt.tight_layout()
+        fig.savefig(output_dir / f"weight_spectra_{group_name}.png", dpi=150)
+        plt.close(fig)
+def plot_effective_rank_progression(results: dict, output_dir: Path,
+                                    model_label: str = "model",
+                                    results_b: dict = None,
+                                    model_b_label: str = "model_b"):
+    """Plot effective rank per layer — the biconcave lens in eigenvalues."""
+    # Extract layer-ordered FFN W1 effective ranks (the main signal path)
+    layer_data = []
+    for name, data in sorted(results.items()):
+        if "w1" in name or "up_proj" in name:
+            # Extract layer index
+            parts = name.split(".")
+            layer_label = name
+            for p in parts:
+                if p.isdigit():
+                    layer_label = p
+                    break
+            layer_data.append((name, data["effective_rank"], data["stable_rank"],
+                             data["alpha"], data["signal_ratio"], layer_label))
+    if not layer_data:
+        return
+    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+    fig.suptitle(f"{model_label} — Layer-wise spectral properties (FFN W1)", fontsize=13)
+    names = [d[0] for d in layer_data]
+    x = range(len(layer_data))
+    short_labels = [d[5] for d in layer_data]
+    # Effective rank
+    axes[0, 0].bar(x, [d[1] for d in layer_data], color="steelblue", alpha=0.8)
+    axes[0, 0].set_ylabel("Effective rank")
+    axes[0, 0].set_title("Effective rank (entropy-based)")
+    axes[0, 0].set_xticks(x)
+    axes[0, 0].set_xticklabels(short_labels, rotation=45, fontsize=7)
+    # Stable rank
+    axes[0, 1].bar(x, [d[2] for d in layer_data], color="coral", alpha=0.8)
+    axes[0, 1].set_ylabel("Stable rank")
+    axes[0, 1].set_title("Stable rank (Frobenius/spectral)")
+    axes[0, 1].set_xticks(x)
+    axes[0, 1].set_xticklabels(short_labels, rotation=45, fontsize=7)
+    # Power-law alpha
+    axes[1, 0].bar(x, [d[3] for d in layer_data], color="mediumpurple", alpha=0.8)
+    axes[1, 0].set_ylabel("Alpha")
+    axes[1, 0].set_title("Power-law exponent (lower = heavier tail = more structure)")
+    axes[1, 0].axhline(2.0, color="red", linestyle="--", alpha=0.5, label="alpha=2 boundary")
+    axes[1, 0].legend(fontsize=8)
+    axes[1, 0].set_xticks(x)
+    axes[1, 0].set_xticklabels(short_labels, rotation=45, fontsize=7)
+    # Signal ratio (above MP)
+    axes[1, 1].bar(x, [d[4] for d in layer_data], color="seagreen", alpha=0.8)
+    axes[1, 1].set_ylabel("Signal ratio")
+    axes[1, 1].set_title("Fraction of singular values above MP bound")
+    axes[1, 1].set_xticks(x)
+    axes[1, 1].set_xticklabels(short_labels, rotation=45, fontsize=7)
+    plt.tight_layout()
+    fig.savefig(output_dir / "layer_progression.png", dpi=150)
+    plt.close(fig)
+def plot_activation_spectra(act_spectra: dict, output_dir: Path,
+                            model_label: str = "model"):
+    """Plot activation eigenspectra across layers."""
+    if not act_spectra:
+        return
+    # Sort layers in processing order
+    order_keys = {"embedding": -1, "final_norm": 999}
+    def sort_key(name):
+        if name in order_keys:
+            return order_keys[name]
+        parts = name.split("_")
+        phase = parts[0]
+        idx = int(parts[1]) if len(parts) > 1 and parts[1].isdigit() else 0
+        phase_offset = {"expand": 0, "middle": 100, "compress": 200, "layer": 0}
+        return phase_offset.get(phase, 300) + idx
+    sorted_names = sorted(act_spectra.keys(), key=sort_key)
+    # -- Eigenvalue distributions --
+    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
+    fig.suptitle(f"{model_label} — Activation eigenspectra", fontsize=13)
+    cmap = plt.cm.coolwarm(np.linspace(0, 1, len(sorted_names)))
+    for idx, name in enumerate(sorted_names):
+        data = act_spectra[name]
+        ev = data["eigenvalues"]
+        axes[0].semilogy(ev / ev.sum(), color=cmap[idx], alpha=0.7, linewidth=1.0, label=name)
+        axes[1].plot(np.cumsum(ev) / ev.sum(), color=cmap[idx], alpha=0.7, linewidth=1.0, label=name)
+    axes[0].set_xlabel("Component")
+    axes[0].set_ylabel("Normalized eigenvalue (log)")
+    axes[0].set_title("Eigenvalue distribution")
+    axes[0].legend(fontsize=6, ncol=2)
+    axes[1].set_xlabel("Component")
+    axes[1].set_ylabel("Cumulative variance explained")
+    axes[1].set_title("Variance concentration")
+    axes[1].axhline(0.9, color="gray", linestyle="--", alpha=0.4, label="90%")
+    axes[1].legend(fontsize=6, ncol=2)
+    plt.tight_layout()
+    fig.savefig(output_dir / "activation_spectra.png", dpi=150)
+    plt.close(fig)
+    # -- Effective rank progression (the lens shape) --
+    fig, ax = plt.subplots(figsize=(12, 5))
+    fig.suptitle(f"{model_label} — Activation effective rank progression", fontsize=13)
+    eranks = [act_spectra[n]["effective_rank"] for n in sorted_names]
+    colors = []
+    for name in sorted_names:
+        if "expand" in name:
+            colors.append("steelblue")
+        elif "middle" in name:
+            colors.append("goldenrod")
+        elif "compress" in name:
+            colors.append("coral")
+        else:
+            colors.append("gray")
+    ax.bar(range(len(sorted_names)), eranks, color=colors, alpha=0.8)
+    ax.set_xticks(range(len(sorted_names)))
+    ax.set_xticklabels(sorted_names, rotation=45, ha="right", fontsize=8)
+    ax.set_ylabel("Effective rank")
+    ax.set_title("Expand (blue) → Middle (gold) → Compress (coral)")
+    plt.tight_layout()
+    fig.savefig(output_dir / "activation_rank_progression.png", dpi=150)
+    plt.close(fig)
+def plot_mirror_comparison(act_spectra: dict, output_dir: Path,
+                           model_label: str = "model"):
+    """Compare expand vs compress activation spectra for each mirror pair."""
+    expand_layers = sorted([n for n in act_spectra if n.startswith("expand_")])
+    compress_layers = sorted([n for n in act_spectra if n.startswith("compress_")])
+    if not expand_layers or not compress_layers:
+        return
+    n_pairs = min(len(expand_layers), len(compress_layers))
+    fig, axes = plt.subplots(1, n_pairs, figsize=(4 * n_pairs, 4), squeeze=False)
+    fig.suptitle(f"{model_label} — Mirror pair activation spectra (expand vs compress)", fontsize=13)
+    for i in range(n_pairs):
+        ax = axes[0, i]
+        exp_ev = act_spectra[expand_layers[i]]["eigenvalues"]
+        comp_ev = act_spectra[compress_layers[i]]["eigenvalues"]
+        n_plot = min(len(exp_ev), len(comp_ev), 100)
+        ax.semilogy(exp_ev[:n_plot] / exp_ev.sum(), color="steelblue", alpha=0.8,
+                    linewidth=1.5, label="expand")
+        ax.semilogy(comp_ev[:n_plot] / comp_ev.sum(), color="coral", alpha=0.8,
+                    linewidth=1.5, label="compress")
+        exp_er = act_spectra[expand_layers[i]]["effective_rank"]
+        comp_er = act_spectra[compress_layers[i]]["effective_rank"]
+        ax.set_title(f"Pair {i}\nerank: {exp_er:.0f} / {comp_er:.0f}", fontsize=10)
+        ax.set_xlabel("Component")
+        if i == 0:
+            ax.set_ylabel("Normalized eigenvalue")
+        ax.legend(fontsize=8)
+    plt.tight_layout()
+    fig.savefig(output_dir / "mirror_pair_comparison.png", dpi=150)
+    plt.close(fig)
+def plot_gate_spectra(results: dict, output_dir: Path, model_label: str = "model"):
+    """Compare W3 vs W4 gate weight spectra (G2LU inner vs outer gate)."""
+    w3_items = [(n, d) for n, d in sorted(results.items()) if "w3" in n and "ffn" in n]
+    w4_items = [(n, d) for n, d in sorted(results.items()) if "w4" in n and "ffn" in n]
+    if not w3_items or not w4_items:
+        return
+    n_pairs = min(len(w3_items), len(w4_items))
+    fig, axes = plt.subplots(2, 1, figsize=(12, 8))
+    fig.suptitle(f"{model_label} — G2LU gate spectra (W3 outer vs W4 inner)", fontsize=13)
+    # Overlay all W3 vs W4
+    cmap_w3 = plt.cm.Blues(np.linspace(0.3, 0.9, n_pairs))
+    cmap_w4 = plt.cm.Reds(np.linspace(0.3, 0.9, n_pairs))
+    for i in range(n_pairs):
+        sv3 = w3_items[i][1]["singular_values"]
+        sv4 = w4_items[i][1]["singular_values"]
+        axes[0].semilogy(sv3, color=cmap_w3[i], alpha=0.6, linewidth=0.8, label=f"W3 pair {i}")
+        axes[0].semilogy(sv4, color=cmap_w4[i], alpha=0.6, linewidth=0.8, label=f"W4 pair {i}")
+    axes[0].set_xlabel("Rank")
+    axes[0].set_ylabel("Singular value (log)")
+    axes[0].set_title("Gate weight spectra")
+    axes[0].legend(fontsize=6, ncol=4)
+    # Effective rank comparison
+    er_w3 = [w3_items[i][1]["effective_rank"] for i in range(n_pairs)]
+    er_w4 = [w4_items[i][1]["effective_rank"] for i in range(n_pairs)]
+    x = np.arange(n_pairs)
+    axes[1].bar(x - 0.15, er_w3, 0.3, color="steelblue", alpha=0.8, label="W3 (outer gate)")
+    axes[1].bar(x + 0.15, er_w4, 0.3, color="coral", alpha=0.8, label="W4 (inner gate)")
+    axes[1].set_xlabel("Mirror pair")
+    axes[1].set_ylabel("Effective rank")
+    axes[1].set_title("Gate effective rank by pair")
+    axes[1].set_xticks(x)
+    axes[1].legend()
+    plt.tight_layout()
+    fig.savefig(output_dir / "gate_spectra.png", dpi=150)
+    plt.close(fig)
+def plot_embedding_alignment(results: dict, act_spectra: dict, output_dir: Path,
+                              model_label: str = "model"):
+    """Compare embedding weight spectrum with final layer activation spectrum."""
+    embed_data = None
+    for name, data in results.items():
+        if "embed" in name.lower() and "proj" not in name.lower() and "g3" not in name.lower() and "g4" not in name.lower():
+            embed_data = data
+            break
+    final_act = act_spectra.get("final_norm") or act_spectra.get("compress_0")
+    if embed_data is None or final_act is None:
+        return
+    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
+    fig.suptitle(f"{model_label} — Embedding vs final activation spectra", fontsize=13)
+    # Normalized comparison
+    sv_embed = embed_data["singular_values"]
+    ev_final = final_act["eigenvalues"]
+    sv_embed_norm = sv_embed / sv_embed.sum()
+    ev_final_norm = ev_final / ev_final.sum()
+    n_plot = min(len(sv_embed_norm), len(ev_final_norm), 200)
+    axes[0].semilogy(sv_embed_norm[:n_plot], color="steelblue", linewidth=1.5,
+                     label=f"Embedding (erank={embed_data['effective_rank']:.0f})")
+    axes[0].semilogy(ev_final_norm[:n_plot], color="coral", linewidth=1.5,
+                     label=f"Final act (erank={final_act['effective_rank']:.0f})")
+    axes[0].set_xlabel("Component")
+    axes[0].set_ylabel("Normalized value (log)")
+    axes[0].set_title("Spectral shape comparison")
+    axes[0].legend()
+    # Cumulative variance
+    axes[1].plot(np.cumsum(sv_embed_norm[:n_plot]), color="steelblue", linewidth=1.5, label="Embedding")
+    axes[1].plot(np.cumsum(ev_final_norm[:n_plot]), color="coral", linewidth=1.5, label="Final activation")
+    axes[1].set_xlabel("Component")
+    axes[1].set_ylabel("Cumulative fraction")
+    axes[1].set_title("Variance concentration")
+    axes[1].axhline(0.9, color="gray", linestyle="--", alpha=0.4)
+    axes[1].legend()
+    plt.tight_layout()
+    fig.savefig(output_dir / "embedding_alignment.png", dpi=150)
+    plt.close(fig)
+def plot_comparison(results_a: dict, results_b: dict,
+                    label_a: str, label_b: str,
+                    output_dir: Path):
+    """Side-by-side comparison of two models' spectral properties."""
+    # Collect effective ranks for FFN W1 / up_proj
+    def extract_ffn_ranks(results):
+        ranks = []
+        for name, data in sorted(results.items()):
+            if ("w1" in name or "up_proj" in name or "c_fc" in name
+                    or "dense_h_to_4h" in name) and "embed" not in name:
+                ranks.append((name, data["effective_rank"], data["stable_rank"], data["alpha"]))
+        return ranks
+    ranks_a = extract_ffn_ranks(results_a)
+    ranks_b = extract_ffn_ranks(results_b)
+    if not ranks_a or not ranks_b:
+        return
+    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
+    fig.suptitle(f"Comparison: {label_a} vs {label_b}", fontsize=13)
+    n = min(len(ranks_a), len(ranks_b))
+    x = np.arange(n)
+    for ax_idx, (metric_idx, ylabel, title) in enumerate([
+        (1, "Effective rank", "Effective rank per layer"),
+        (2, "Stable rank", "Stable rank per layer"),
+        (3, "Alpha", "Power-law alpha per layer"),
+    ]):
+        vals_a = [ranks_a[i][metric_idx] for i in range(n)]
+        vals_b = [ranks_b[i][metric_idx] for i in range(n)]
+        axes[ax_idx].bar(x - 0.15, vals_a, 0.3, color="steelblue", alpha=0.8, label=label_a)
+        axes[ax_idx].bar(x + 0.15, vals_b, 0.3, color="coral", alpha=0.8, label=label_b)
+        axes[ax_idx].set_xlabel("Layer")
+        axes[ax_idx].set_ylabel(ylabel)
+        axes[ax_idx].set_title(title)
+        axes[ax_idx].legend(fontsize=8)
+    plt.tight_layout()
+    fig.savefig(output_dir / "comparison.png", dpi=150)
+    plt.close(fig)
+# ---------------------------------------------------------------------------
+# Summary report
+# ---------------------------------------------------------------------------
+def print_summary(results: dict, model_label: str, act_spectra: dict = None):
+    """Print a concise text summary of spectral analysis."""
+    print(f"\n{'='*70}")
+    print(f"  Spectral Analysis: {model_label}")
+    print(f"{'='*70}")
+    # Group by component type
+    components = defaultdict(list)
+    for name, data in sorted(results.items()):
+        if "w1" in name or "up_proj" in name:
+            components["FFN W1 (up)"].append(data)
+        elif "w2" in name or "down_proj" in name:
+            components["FFN W2 (down)"].append(data)
+        elif "w3" in name:
+            components["FFN W3 (outer gate)"].append(data)
+        elif "w4" in name:
+            components["FFN W4 (inner gate)"].append(data)
+        elif "embed" in name.lower() and "proj" not in name and "g3" not in name and "g4" not in name:
+            components["Embedding"].append(data)
+    print(f"\n{'Component':<25} {'Shape':>12} {'eRank':>8} {'sRank':>8} {'Alpha':>8} {'Sig%':>8} {'Cond#':>10}")
+    print("-" * 85)
+    for comp_name, items in components.items():
+        for i, data in enumerate(items):
+            label = f"{comp_name}" if len(items) == 1 else f"{comp_name}[{i}]"
+            shape_str = f"{data['shape'][0]}x{data['shape'][1]}"
+            cond = f"{data['condition_number']:.0f}" if data['condition_number'] < 1e6 else "inf"
+            print(f"{label:<25} {shape_str:>12} {data['effective_rank']:>8.1f} "
+                  f"{data['stable_rank']:>8.1f} {data['alpha']:>8.3f} "
+                  f"{data['signal_ratio']*100:>7.1f}% {cond:>10}")
+    # Aggregate stats
+    all_alphas = [d["alpha"] for d in results.values() if d["alpha"] > 0]
+    all_eranks = [d["effective_rank"] for d in results.values()]
+    if all_alphas:
+        print(f"\n  Mean alpha: {np.mean(all_alphas):.3f}  (< 2.0 = heavy-tailed = well-structured)")
+        print(f"  Mean effective rank: {np.mean(all_eranks):.1f}")
+    # Activation summary
+    if act_spectra:
+        print(f"\n  Activation spectra:")
+        print(f"  {'Layer':<25} {'eRank':>8} {'Top1%':>8} {'Top10%':>8}")
+        print("  " + "-" * 55)
+        order_keys = {"embedding": -1, "final_norm": 999}
+        def sort_key(name):
+            if name in order_keys:
+                return order_keys[name]
+            parts = name.split("_")
+            phase = parts[0]
+            idx = int(parts[1]) if len(parts) > 1 and parts[1].isdigit() else 0
+            phase_offset = {"expand": 0, "middle": 100, "compress": 200, "layer": 0}
+            return phase_offset.get(phase, 300) + idx
+        for name in sorted(act_spectra.keys(), key=sort_key):
+            data = act_spectra[name]
+            print(f"  {name:<25} {data['effective_rank']:>8.1f} "
+                  f"{data['top1_variance_ratio']*100:>7.1f}% "
+                  f"{data['top10_variance_ratio']*100:>7.1f}%")
+def save_results_json(results: dict, act_spectra: dict, output_path: Path):
+    """Save numerical results (no numpy arrays) to JSON."""
+    out = {}
+    for name, data in results.items():
+        out[name] = {k: v for k, v in data.items() if k != "singular_values"}
+        out[name]["top_10_sv"] = data["singular_values"][:10].tolist()
+    if act_spectra:
+        out["_activations"] = {}
+        for name, data in act_spectra.items():
+            out["_activations"][name] = {k: v for k, v in data.items() if k != "eigenvalues"}
+            out["_activations"][name]["top_10_ev"] = data["eigenvalues"][:10].tolist()
+    with open(output_path, "w") as f:
+        json.dump(out, f, indent=2, default=str)
+# ---------------------------------------------------------------------------
+# Data loading (minimal — just enough tokens for activation analysis)
+# ---------------------------------------------------------------------------
+def load_sample_data(data_source: str, tokenizer_name: str, num_samples: int = 256,
+                     context_length: int = 512, device: str = "cpu"):
+    """Load a small batch of tokenized data for activation analysis."""
+    sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
+    from circuits.data import get_tokenizer
+    tokenizer = get_tokenizer(tokenizer_name)
+    if data_source.startswith("hf:"):
+        from datasets import load_dataset
+        parts = data_source[3:].split(":")
+        ds_name = parts[0]
+        ds_config = parts[1] if len(parts) > 1 else None
+        ds_split = parts[2] if len(parts) > 2 else "train"
+        dataset = load_dataset(ds_name, ds_config, split=ds_split, streaming=True)
+        texts = []
+        for item in dataset:
+            texts.append(item.get("text", ""))
+            if len(texts) >= num_samples:
+                break
+    else:
+        with open(data_source) as f:
+            texts = [line.strip() for line in f if line.strip()][:num_samples]
+    # Tokenize and create batches
+    all_ids = []
+    for text in texts:
+        ids = tokenizer.encode(text)
+        if len(ids) >= context_length:
+            all_ids.append(ids[:context_length])
+        elif len(ids) > 32:
+            all_ids.append(ids + [tokenizer.eos_token_id] * (context_length - len(ids)))
+    if not all_ids:
+        return None, tokenizer
+    input_ids = torch.tensor(all_ids[:num_samples], device=device)
+    return input_ids, tokenizer
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(description="Spectral analysis of Prisma checkpoints")
+    parser.add_argument("--checkpoint", type=str, required=True, help="Path to Prisma/Circuit checkpoint")
+    parser.add_argument("--checkpoint-b", type=str, default=None, help="Second checkpoint for comparison")
+    parser.add_argument("--hf-model", type=str, default=None, help="HuggingFace model name for comparison")
+    parser.add_argument("--data", type=str, default=None,
+                        help="Data source for activation analysis (hf:dataset:config:split or path)")
+    parser.add_argument("--num-samples", type=int, default=256, help="Number of samples for activation analysis")
+    parser.add_argument("--context-length", type=int, default=512, help="Context length for activation analysis")
+    parser.add_argument("--output-dir", type=str, default=None, help="Output directory (default: auto)")
+    parser.add_argument("--gpu", type=int, default=0, help="GPU index")
+    parser.add_argument("--no-activations", action="store_true", help="Skip activation analysis even if data provided")
+    args = parser.parse_args()
+    device = f"cuda:{args.gpu}" if torch.cuda.is_available() else "cpu"
+    print(f"Device: {device}")
+    # Output directory
+    if args.output_dir:
+        output_dir = Path(args.output_dir)
+    else:
+        ckpt_name = Path(args.checkpoint).parent.name
+        output_dir = Path("circuits/scripts/spectral_output") / ckpt_name
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print(f"Output: {output_dir}")
+    # ── Load model A ──
+    print(f"\nLoading: {args.checkpoint}")
+    model_a, config_a, model_type_a = load_prisma_model(args.checkpoint, device)
+    label_a = Path(args.checkpoint).parent.name
+    print(f"  Type: {model_type_a}")
+    n_params = sum(p.numel() for p in model_a.parameters())
+    print(f"  Parameters: {n_params:,}")
+    # ── Weight spectra (A) ──
+    print("\nAnalyzing weight spectra...")
+    weight_results_a = analyze_weight_spectra(model_a, label_a)
+    print(f"  Analyzed {len(weight_results_a)} weight matrices")
+    # ── Activation spectra (A) ──
+    act_spectra_a = None
+    if args.data and not args.no_activations:
+        tokenizer_name = torch.load(args.checkpoint, map_location="cpu",
+                                     weights_only=False).get("tokenizer_name", "gpt2")
+        print(f"\nLoading data for activation analysis ({args.num_samples} samples)...")
+        input_ids, tokenizer = load_sample_data(
+            args.data, tokenizer_name, args.num_samples, args.context_length, device
+        )
+        if input_ids is not None:
+            print(f"  Data shape: {input_ids.shape}")
+            # Compute word positions if needed
+            word_positions = None
+            word_rope_dims = config_a.get("word_rope_dims", 0)
+            if word_rope_dims > 0:
+                from circuits.layers import build_word_start_table, compute_word_positions
+                word_start_table = build_word_start_table(tokenizer, len(tokenizer)).to(device)
+                word_positions = compute_word_positions(input_ids, word_start_table)
+            print("  Collecting activations...")
+            if model_type_a == "mirrored":
+                raw_acts = collect_mirrored_activations(model_a, input_ids, word_positions)
+            else:
+                raw_acts = collect_activations(model_a, input_ids, word_positions, model_type_a)
+            print(f"  Computing activation spectra ({len(raw_acts)} layers)...")
+            act_spectra_a = {}
+            for name, act in raw_acts.items():
+                spec = activation_spectrum(act)
+                if spec is not None:
+                    act_spectra_a[name] = spec
+    # ── Model B (optional comparison) ──
+    weight_results_b = None
+    label_b = None
+    if args.checkpoint_b:
+        print(f"\nLoading comparison: {args.checkpoint_b}")
+        model_b, config_b, model_type_b = load_prisma_model(args.checkpoint_b, device)
+        label_b = Path(args.checkpoint_b).parent.name
+        weight_results_b = analyze_weight_spectra(model_b, label_b)
+        del model_b
+    elif args.hf_model:
+        print(f"\nLoading HF model: {args.hf_model}")
+        model_b = load_hf_model(args.hf_model, device)
+        label_b = args.hf_model
+        weight_results_b = analyze_weight_spectra(model_b, label_b)
+        del model_b
+    if device.startswith("cuda"):
+        torch.cuda.empty_cache()
+    # ── Plots ──
+    print("\nGenerating plots...")
+    plot_weight_spectra(weight_results_a, output_dir, label_a)
+    plot_effective_rank_progression(weight_results_a, output_dir, label_a)
+    plot_gate_spectra(weight_results_a, output_dir, label_a)
+    if act_spectra_a:
+        plot_activation_spectra(act_spectra_a, output_dir, label_a)
+        plot_mirror_comparison(act_spectra_a, output_dir, label_a)
+        plot_embedding_alignment(weight_results_a, act_spectra_a, output_dir, label_a)
+    if weight_results_b and label_b:
+        plot_comparison(weight_results_a, weight_results_b, label_a, label_b, output_dir)
+        # Also print summary for B
+        print_summary(weight_results_b, label_b)
+    # ── Summary ──
+    print_summary(weight_results_a, label_a, act_spectra_a)
+    # ── Save ──
+    save_results_json(weight_results_a, act_spectra_a, output_dir / "results.json")
+    if weight_results_b:
+        save_results_json(weight_results_b, None, output_dir / "results_b.json")
+    print(f"\nAll outputs saved to: {output_dir}")
+    print(f"  Plots: {len(list(output_dir.glob('*.png')))} PNG files")
+    print(f"  Data:  results.json")
+if __name__ == "__main__":
+    main()

scripts/spectral_to_csv.py ADDED Viewed

	@@ -0,0 +1,202 @@

+"""Convert spectral analysis JSON results to CSV tables for analysis."""
+import json
+import csv
+import sys
+import os
+import re
+from pathlib import Path
+def classify_layer(name, model_type):
+    """Classify a weight matrix by layer index, component type, and phase."""
+    if model_type == "prisma":
+        # mirror_blocks.N.component
+        m = re.match(r'mirror_blocks\.(\d+)\.', name)
+        if m:
+            layer_idx = int(m.group(1))
+            phase = "mirror"
+            if 'attn' in name:
+                comp = 'Q' if 'q_proj' in name else 'K' if 'k_proj' in name else 'V' if 'v_proj' in name else 'O' if 'o_proj' in name else 'attn'
+            elif 'ffn.w3' in name or 'gate_expand' in name:
+                comp = 'W3'
+            elif 'ffn.w4' in name or 'gate_compress' in name:
+                comp = 'W4'
+            elif 'ffn.w1' in name:
+                comp = 'W1'
+            elif 'w2' in name:
+                comp = 'W2'
+            else:
+                comp = 'other'
+            return layer_idx, comp, phase
+        m = re.match(r'middle_blocks\.(\d+)\.', name)
+        if m:
+            layer_idx = int(m.group(1))
+            phase = "middle"
+            if 'attn' in name:
+                comp = 'Q' if 'q_proj' in name else 'K' if 'k_proj' in name else 'V' if 'v_proj' in name else 'O' if 'o_proj' in name else 'attn'
+            elif 'gate' in name:
+                comp = 'W3'
+            elif 'ffn.w1' in name:
+                comp = 'W1'
+            elif 'ffn.w2' in name:
+                comp = 'W2'
+            else:
+                comp = 'other'
+            return layer_idx, comp, phase
+        m = re.match(r'(first|last)_block\.', name)
+        if m:
+            phase = m.group(1)
+            if 'attn' in name:
+                comp = 'Q' if 'q_proj' in name else 'K' if 'k_proj' in name else 'V' if 'v_proj' in name else 'O' if 'o_proj' in name else 'attn'
+            elif 'ffn.w3' in name or 'gate' in name:
+                comp = 'W3'
+            elif 'ffn.w4' in name:
+                comp = 'W4'
+            elif 'ffn.w1' in name:
+                comp = 'W1'
+            elif 'ffn.w2' in name:
+                comp = 'W2'
+            else:
+                comp = 'other'
+            return 0, comp, phase
+        if 'embed' in name:
+            return -1, 'embed', 'embed'
+        if 'head' in name or 'lm_head' in name:
+            return 99, 'head', 'head'
+        return -1, 'other', 'other'
+    else:  # GPT-2 style
+        m = re.match(r'transformer\.h\.(\d+)\.', name)
+        if m:
+            layer_idx = int(m.group(1))
+            if 'c_attn' in name:
+                comp = 'QKV'
+            elif 'c_proj' in name and 'mlp' not in name:
+                comp = 'O'
+            elif 'c_fc' in name:
+                comp = 'W1'
+            elif 'mlp.c_proj' in name:
+                comp = 'W2'
+            else:
+                comp = 'other'
+            return layer_idx, comp, "layer"
+        if 'wte' in name:
+            return -1, 'embed', 'embed'
+        if 'wpe' in name:
+            return -1, 'pos_embed', 'embed'
+        return -1, 'other', 'other'
+def json_to_csvs(json_path, output_dir, model_type="prisma"):
+    with open(json_path) as f:
+        data = json.load(f)
+    os.makedirs(output_dir, exist_ok=True)
+    # 1. Full weight matrix summary
+    rows = []
+    for name, info in data.items():
+        if 'activation' in name or name.startswith('_'):
+            continue
+        layer_idx, comp, phase = classify_layer(name, model_type)
+        rows.append({
+            'name': name,
+            'layer_idx': layer_idx,
+            'component': comp,
+            'phase': phase,
+            'shape': 'x'.join(str(s) for s in info['shape']),
+            'effective_rank': round(info['effective_rank'], 2),
+            'stable_rank': round(info['stable_rank'], 3),
+            'spectral_norm': round(info['spectral_norm'], 4),
+            'frobenius_norm': round(info['frobenius_norm'], 4),
+            'alpha': round(info['alpha'], 4),
+            'alpha_r2': round(info['alpha_r2'], 4),
+            'signal_ratio': round(info['signal_ratio'], 4),
+            'condition_number': round(info['condition_number'], 2),
+            'mp_bound': round(info['mp_bound'], 4),
+            'n_above_mp': info['n_above_mp'],
+            'n_total': info['n_total'],
+            'sv_1': round(info['top_10_sv'][0], 4) if info['top_10_sv'] else 0,
+            'sv_2': round(info['top_10_sv'][1], 4) if len(info['top_10_sv']) > 1 else 0,
+            'sv_10': round(info['top_10_sv'][9], 4) if len(info['top_10_sv']) > 9 else 0,
+            'sv1_sv2_ratio': round(info['top_10_sv'][0] / info['top_10_sv'][1], 4) if len(info['top_10_sv']) > 1 and info['top_10_sv'][1] > 0 else 0,
+        })
+    with open(os.path.join(output_dir, 'weights_full.csv'), 'w', newline='') as f:
+        w = csv.DictWriter(f, fieldnames=rows[0].keys())
+        w.writeheader()
+        w.writerows(sorted(rows, key=lambda r: (r['phase'], r['layer_idx'], r['component'])))
+    # 2. Layer-level FFN summary (W1 progression = the lens)
+    ffn_rows = [r for r in rows if r['component'] == 'W1']
+    with open(os.path.join(output_dir, 'ffn_w1_progression.csv'), 'w', newline='') as f:
+        w = csv.DictWriter(f, fieldnames=['layer_idx', 'phase', 'effective_rank', 'stable_rank', 'alpha', 'alpha_r2', 'signal_ratio', 'condition_number', 'sv1_sv2_ratio'])
+        w.writeheader()
+        for r in sorted(ffn_rows, key=lambda r: (r['phase'], r['layer_idx'])):
+            w.writerow({k: r[k] for k in w.fieldnames})
+    # 3. Gate comparison (W3 vs W4)
+    gate_rows = [r for r in rows if r['component'] in ('W3', 'W4') and r['phase'] == 'mirror']
+    with open(os.path.join(output_dir, 'gate_comparison.csv'), 'w', newline='') as f:
+        w = csv.DictWriter(f, fieldnames=['layer_idx', 'component', 'effective_rank', 'stable_rank', 'alpha', 'alpha_r2', 'signal_ratio', 'sv1_sv2_ratio'])
+        w.writeheader()
+        for r in sorted(gate_rows, key=lambda r: (r['layer_idx'], r['component'])):
+            w.writerow({k: r[k] for k in w.fieldnames})
+    # 4. Attention head comparison (Q, K, V, O per layer)
+    attn_rows = [r for r in rows if r['component'] in ('Q', 'K', 'V', 'O', 'QKV')]
+    with open(os.path.join(output_dir, 'attention_progression.csv'), 'w', newline='') as f:
+        w = csv.DictWriter(f, fieldnames=['layer_idx', 'phase', 'component', 'effective_rank', 'stable_rank', 'alpha', 'signal_ratio', 'condition_number'])
+        w.writeheader()
+        for r in sorted(attn_rows, key=lambda r: (r['phase'], r['layer_idx'], r['component'])):
+            w.writerow({k: r[k] for k in w.fieldnames})
+    # 5. Summary statistics
+    alphas = [r['alpha'] for r in rows if r['alpha'] > 0]
+    eff_ranks = [r['effective_rank'] for r in rows if r['layer_idx'] >= 0]
+    signal_ratios = [r['signal_ratio'] for r in rows if r['layer_idx'] >= 0]
+    summary = {
+        'n_matrices': len(rows),
+        'mean_alpha': round(sum(alphas) / len(alphas), 4) if alphas else 0,
+        'min_alpha': round(min(alphas), 4) if alphas else 0,
+        'max_alpha': round(max(alphas), 4) if alphas else 0,
+        'mean_effective_rank': round(sum(eff_ranks) / len(eff_ranks), 2) if eff_ranks else 0,
+        'mean_signal_ratio': round(sum(signal_ratios) / len(signal_ratios), 4) if signal_ratios else 0,
+        'n_well_trained (alpha<2)': sum(1 for a in alphas if a < 2.0),
+        'n_total_alpha': len(alphas),
+    }
+    with open(os.path.join(output_dir, 'summary.csv'), 'w', newline='') as f:
+        w = csv.DictWriter(f, fieldnames=summary.keys())
+        w.writeheader()
+        w.writerow(summary)
+    print(f"Wrote CSVs to {output_dir}/")
+    print(f"  weights_full.csv        ({len(rows)} matrices)")
+    print(f"  ffn_w1_progression.csv  ({len(ffn_rows)} layers)")
+    print(f"  gate_comparison.csv     ({len(gate_rows)} entries)")
+    print(f"  attention_progression.csv ({len(attn_rows)} entries)")
+    print(f"  summary.csv")
+if __name__ == '__main__':
+    base = "circuits/scripts/spectral_output/mirrored_300M_mk4_cont"
+    # Prisma
+    json_to_csvs(
+        f"{base}/results.json",
+        f"{base}/csv_prisma",
+        model_type="prisma"
+    )
+    # GPT-2 medium
+    if os.path.exists(f"{base}/results_b.json"):
+        json_to_csvs(
+            f"{base}/results_b.json",
+            f"{base}/csv_gpt2",
+            model_type="gpt2"
+        )

train.py ADDED Viewed

	@@ -0,0 +1,637 @@

+#!/usr/bin/env python3
+"""
+Training script for Circuit Transformer.
+Usage:
+    python circuits/train.py --data hf:roneneldan/TinyStories --preset tiny --epochs 1 --gpu 0
+    python circuits/train.py --data path/to/corpus.txt --dims 256 --layers 6 --fp16
+"""
+import gc
+import os
+import time
+import math
+import random
+from pathlib import Path
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.cuda.amp import GradScaler
+from torch.amp import autocast
+from .config import CircuitConfig, parse_args
+from .model import CircuitTransformer, count_parameters
+from .mirrored import MirroredConfig, MirroredTransformer, count_mirrored_parameters
+from .graft_g2lu import G2LU_GraftedModel, save_g2lu_checkpoint
+from .layers import build_word_start_table, compute_word_positions
+from .data import get_tokenizer, load_data, create_dataloader
+def corrupt_tokens(input_ids, ratio, vocab_size):
+    """Replace random tokens with random vocab tokens for denoising autoencoder.
+    Returns (corrupted_ids, mask) where mask is True at corrupted positions.
+    """
+    mask = torch.rand(input_ids.shape, device=input_ids.device) < ratio
+    mask[:, 0] = False  # never corrupt first token (BOS/start)
+    random_tokens = torch.randint(0, vocab_size, input_ids.shape, device=input_ids.device)
+    corrupted = input_ids.clone()
+    corrupted[mask] = random_tokens[mask]
+    return corrupted, mask
+@torch.no_grad()
+def evaluate(config, model, dataloader, device, use_amp=False, amp_dtype=torch.float16, mid_run_eval=False,
+             word_start_table=None):
+    """Run validation and return avg loss + perplexity."""
+    model.eval()
+    total_loss = 0.0
+    n_batches = 0
+    for batch in dataloader:
+        input_ids = batch["input_ids"].to(device)
+        labels = batch["labels"].to(device)
+        word_positions = None
+        if word_start_table is not None:
+            word_positions = compute_word_positions(input_ids, word_start_table)
+        if use_amp:
+            with autocast('cuda', dtype=amp_dtype):
+                output = model(input_ids, labels=labels, word_positions=word_positions)
+        else:
+            output = model(input_ids, labels=labels, word_positions=word_positions)
+        total_loss += output["loss"].item()
+        n_batches += 1
+        if n_batches % (config.log_every * 10)  == 0:
+            avg_loss = total_loss / max(n_batches, 1)
+            ppl = math.exp(min(avg_loss, 20))
+            print(
+                f"batch {n_batches:6d}/{len(dataloader):6d} | "
+                f"Loss {total_loss / n_batches:.4f} | "
+                f"PPL {ppl:8.2f}"
+            )
+        if mid_run_eval and n_batches >= 1500 :
+            break
+    if not mid_run_eval:
+        model.train()
+    avg_loss = total_loss / max(n_batches, 1)
+    ppl = math.exp(min(avg_loss, 20))  # cap to avoid overflow
+    return avg_loss, ppl
+def get_lr(step: int, warmup_steps: int, max_steps: int, max_lr: float, min_lr: float = 0.0, delay: int = 0) -> float:
+    """Cosine learning rate schedule with warmup and optional delay.
+    With delay > 0, the schedule is shifted:
+      Steps 0..delay:                    LR = 0 (frozen)
+      Steps delay..delay+warmup:         linear ramp 0 → max_lr
+      Steps delay+warmup..max_steps:     cosine decay max_lr → min_lr
+    """
+    if step < delay:
+        return 0.0
+    effective_step = step - delay
+    effective_max = max(1, max_steps - delay)
+    if effective_step < warmup_steps:
+        return max_lr * effective_step / warmup_steps
+    if effective_step >= effective_max:
+        return min_lr
+    progress = (effective_step - warmup_steps) / (effective_max - warmup_steps)
+    return min_lr + 0.5 * (max_lr - min_lr) * (1 + math.cos(math.pi * progress))
+def save_checkpoint(
+    model: nn.Module,
+    optimizer: torch.optim.Optimizer,
+    step: int,
+    epoch: int,
+    loss: float,
+    config,
+    path: str,
+    model_type: str = "standard",
+    epoch_step: int = 0,
+    best_val_loss: float | None = None,
+    scaler=None,
+    tokenizer_name: str = "gpt2",
+):
+    """Save training checkpoint.
+    Args:
+        epoch: Next epoch to start on resume (completed epoch count).
+        epoch_step: Batches already processed in `epoch` (0 if epoch is complete).
+        optimizer_mid: Middle optimizer for dual-path training (optional).
+    """
+    checkpoint = {
+        "model": model.state_dict(),
+        "optimizer": optimizer.state_dict(),
+        "step": step,
+        "epoch": epoch,
+        "epoch_step": epoch_step,
+        "loss": loss,
+        "config": config.to_dict(),
+        "model_type": model_type,
+        "tokenizer_name": tokenizer_name,
+    }
+    if best_val_loss is not None:
+        checkpoint["best_val_loss"] = best_val_loss
+    if scaler is not None:
+        checkpoint["scaler"] = scaler.state_dict()
+    torch.save(checkpoint, path)
+def _migrate_state_dict(state_dict: dict, model: nn.Module) -> dict:
+    """Migrate checkpoint state_dict to match current model architecture.
+    Handles upgrades like SwiGLU → MirroredSwiGLU (dual_gate_middle).
+    """
+    model_keys = set(model.state_dict().keys())
+    ckpt_keys = set(state_dict.keys())
+    missing = model_keys - ckpt_keys
+    unexpected = ckpt_keys - model_keys
+    if not missing and not unexpected:
+        return state_dict  # perfect match, no migration needed
+    migrated = dict(state_dict)
+    migrations = []
+    # SwiGLU → MirroredSwiGLU: w3 → gate_expand (dual_gate_middle upgrade)
+    for key in list(unexpected):
+        if ".ffn.gate_expand.weight" in key:
+            new_key = key.replace(".ffn.gate_expand.weight", ".ffn.w3.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+        if ".ffn.gate_compress.weight" in key:
+            new_key = key.replace(".ffn.gate_compress.weight", ".ffn.w4.weight")
+            if new_key in missing:
+                migrated[new_key] = migrated.pop(key)
+                missing.discard(new_key)
+                unexpected.discard(key)
+                migrations.append(f"  {key} → {new_key}")
+    if migrations:
+        print(f"State dict migration ({len(migrations)} keys renamed):")
+        for m in migrations:
+            print(m)
+        # Report remaining missing keys (freshly initialized)
+        still_missing = model_keys - set(migrated.keys())
+        if still_missing:
+            print(f"  New parameters (freshly initialized): {len(still_missing)}")
+            for k in sorted(still_missing):
+                print(f"    {k}")
+    return migrated
+def load_checkpoint(path: str, model: nn.Module, optimizer: torch.optim.Optimizer = None,
+                    scaler=None, reset:bool = False):
+    """Load training checkpoint. Returns dict with resume info."""
+    checkpoint = torch.load(path, map_location="cpu", weights_only=False)
+    state_dict = _migrate_state_dict(checkpoint["model"], model)
+    model.load_state_dict(state_dict, strict=False)
+    if not reset:
+        if optimizer is not None and "optimizer" in checkpoint:
+            optimizer.load_state_dict(checkpoint["optimizer"])
+        if scaler is not None and "scaler" in checkpoint:
+            scaler.load_state_dict(checkpoint["scaler"])
+    return {
+        "step": checkpoint.get("step", 0),
+        "epoch": checkpoint.get("epoch", 0),
+        "epoch_step": checkpoint.get("epoch_step", 0),
+        "best_val_loss": checkpoint.get("best_val_loss", float("inf")),
+    }
+def train():
+    config, args = parse_args()
+    # Setup device
+    device = torch.device(f"cuda:{config.gpu}" if torch.cuda.is_available() else "cpu")
+    print(f"Device: {device}")
+    # Load tokenizer and data
+    print(f"Loading data from: {args.data}")
+    model_type = args.arch
+    tokenizer_name = getattr(args, 'tokenizer', 'gpt2')
+    if model_type == "graft_g2lu":
+        tokenizer_name = args.pretrained
+    tokenizer = get_tokenizer(tokenizer_name)
+    config.vocab_size = len(tokenizer)
+    print(f"Tokenizer: {tokenizer_name} (vocab_size={config.vocab_size})")
+    cache_dir = None if args.no_cache else args.cache_dir
+    dataset = load_data(
+        args.data,
+        tokenizer,
+        config.max_seq_len,
+        text_column=args.text_column,
+        num_samples=args.num_samples,
+        cache_dir=cache_dir,
+        data_format=args.data_format,
+    )
+    print(f"Loaded {len(dataset):,} chunks")
+    # Train/val split
+    val_split = args.val_split
+    if val_split > 0 and len(dataset) > 20:
+        train_dataset, val_dataset = dataset.split(val_split)
+        print(f"Split: {len(train_dataset):,} train / {len(val_dataset):,} val ({val_split:.0%})")
+    else:
+        train_dataset = dataset
+        val_dataset = None
+    # Create dataloaders
+    dataloader = create_dataloader(
+        train_dataset,
+        config.batch_size,
+        shuffle=True,
+    )
+    val_dataloader = None
+    if val_dataset is not None:
+        val_dataloader = create_dataloader(
+            val_dataset,
+            config.batch_size,
+            shuffle=False,
+        )
+    # Create model
+    if model_type == "mirrored":
+        model_config = MirroredConfig(
+            vocab_size=config.vocab_size,
+            hidden_size=config.hidden_size,
+            num_heads=config.num_heads,
+            num_kv_heads=config.num_kv_heads,
+            num_layers=config.num_layers,
+            n_middle=args.n_middle,
+            max_seq_len=config.max_seq_len,
+            dropout=config.dropout,
+            use_g2lu=not getattr(args, 'no_g2lu', False),
+            aux_skip_k=getattr(args, 'aux_skip', 0),
+            aux_skip_weight=getattr(args, 'aux_weight', 0.1),
+            word_rope_dims=getattr(config, 'word_rope_dims', 0),
+            word_rope_base=getattr(config, 'word_rope_base', 10.0),
+            embed_dim=getattr(config, 'embed_dim', 0),
+            head_dim=getattr(config, 'head_dim', 0),
+        )
+        model = MirroredTransformer(model_config).to(device)
+        param_info = count_mirrored_parameters(model)
+        num_params = param_info["unique"]
+        print(f"Model: MirroredTransformer")
+        print(f"  Virtual layers: {model.total_virtual_layers} ({model_config.n_mirror} mirror pairs + {model_config.n_middle} middle)")
+        print(f"  Parameters: {num_params:,} ({num_params/1e6:.1f}M unique)")
+        print(f"  Shared FFN base: {param_info['shared_ffn_base']:,}")
+        print(f"  Direction gates: {param_info['direction_gates']:,}")
+        print(f"  FFN gating: {'G²LU (nested dual gate)' if model_config.use_g2lu else 'SwiGLU (vanilla)'}")
+        if model_config.num_kv_heads is not None:
+            print(f"  GQA: {model_config.num_heads}Q / {model_config.num_kv_heads}KV ({model_config.num_heads // model_config.num_kv_heads}:1 ratio)")
+        if model_config.aux_skip_k > 0:
+            print(f"  Aux skip prediction: t+{model_config.aux_skip_k} (weight={model_config.aux_skip_weight})")
+        if getattr(model_config, 'embed_dim', 0) > 0:
+            std_embed = config.vocab_size * config.hidden_size
+            fact_embed = config.vocab_size * model_config.embed_dim + model_config.embed_dim * config.hidden_size
+            print(f"  Factorized embedding: {model_config.embed_dim} → {config.hidden_size} (saves {(std_embed - fact_embed):,} params)")
+        if getattr(model_config, 'head_dim', 0) > 0:
+            std_head = config.hidden_size * config.vocab_size
+            mlp_head = config.hidden_size * model_config.head_dim + model_config.head_dim * config.vocab_size
+            print(f"  MLP head: {config.hidden_size} → {model_config.head_dim} → vocab (saves {(std_head - mlp_head):,} params)")
+    elif model_type == "graft_g2lu":
+        assert args.pretrained, "--pretrained is required for graft_g2lu architecture"
+        amp_dtype = torch.bfloat16 if config.bf16 else (torch.float16 if config.fp16 else torch.float32)
+        model = G2LU_GraftedModel(
+            pretrained_name=args.pretrained,
+            align_weight=args.align_weight,
+            warmup_steps=args.graft_warmup,
+            device=device,
+            dtype=amp_dtype,
+        )
+        model_config = None  # No CircuitConfig for HF models
+        num_params = sum(p.numel() for p in model.model.parameters() if p.requires_grad)
+    else:
+        model_config = config
+        model = CircuitTransformer(config).to(device)
+        num_params = count_parameters(model)
+        print(f"Model: CircuitTransformer")
+        print(f"  Parameters: {num_params:,} ({num_params/1e6:.1f}M)")
+        if getattr(config, 'aux_skip_k', 0) > 0:
+            print(f"  Aux skip prediction: t+{config.aux_skip_k} (weight={config.aux_skip_weight})")
+        if getattr(config, 'embed_dim', 0) > 0:
+            std_embed = config.vocab_size * config.hidden_size
+            fact_embed = config.vocab_size * config.embed_dim + config.embed_dim * config.hidden_size
+            print(f"  Factorized embedding: {config.embed_dim} → {config.hidden_size} (saves {(std_embed - fact_embed):,} params)")
+        if getattr(config, 'head_dim', 0) > 0:
+            std_head = config.hidden_size * config.vocab_size
+            mlp_head = config.hidden_size * config.head_dim + config.head_dim * config.vocab_size
+            print(f"  MLP head: {config.hidden_size} → {config.head_dim} → vocab (saves {(std_head - mlp_head):,} params)")
+    # Build word-position table if enabled
+    word_rope_dims = getattr(config, 'word_rope_dims', 0)
+    if word_rope_dims > 0:
+        word_start_table = build_word_start_table(tokenizer, len(tokenizer)).to(device)
+        print(f"  Word-position RoPE: {word_rope_dims} dims, base={getattr(config, 'word_rope_base', 10.0)}")
+        print(f"  Word starters in vocab: {word_start_table.sum().item():,} / {len(tokenizer):,}")
+    else:
+        word_start_table = None
+    # Keep raw reference for set_gate_step (torch.compile wraps the model)
+    raw_model = model
+    # Optionally compile
+    if config.compile and hasattr(torch, "compile"):
+        print("Compiling model with torch.compile...")
+        model = torch.compile(raw_model)
+    # Optimizer — with optional staggered warmup and dual-path training
+    grad_accum = getattr(args, 'grad_accum', 1)
+    opt_params = list(raw_model.trainable_parameters()) if model_type == "graft_g2lu" else model.parameters()
+    optimizer = torch.optim.AdamW(
+        opt_params,
+        lr=config.learning_rate,
+        weight_decay=config.weight_decay,
+        betas=(0.9, 0.95),
+    )
+    # Mixed precision
+    use_amp = (config.fp16 or config.bf16) and device.type == "cuda"
+    amp_dtype = torch.bfloat16 if config.bf16 else torch.float16
+    scaler = GradScaler() if (config.fp16 and use_amp) else None
+    if use_amp:
+        print(f"  Mixed precision: {'BF16' if config.bf16 else 'FP16'}" +
+              (" (no scaler)" if scaler is None else " (with GradScaler)"))
+    # Resume from checkpoint
+    start_step = 0
+    start_epoch = 0
+    skip_batches = 0
+    best_val_loss = float("inf")
+    if args.resume:
+        print(f"Resuming from: {args.resume}")
+        resume_info = load_checkpoint(args.resume, model, optimizer, scaler, args.reset)
+        if not args.reset:
+            start_step = resume_info["step"]
+            start_epoch = resume_info["epoch"]
+            skip_batches = resume_info["epoch_step"]
+        best_val_loss = resume_info["best_val_loss"]
+        print(f"Resumed at step {start_step}, epoch {start_epoch}" +
+              (f", skipping {skip_batches} batches" if skip_batches > 0 else ""))
+        if best_val_loss < float("inf"):
+            print(f"  Best val loss so far: {best_val_loss:.4f} (PPL {math.exp(min(best_val_loss, 20)):.2f})")
+    # Setup checkpoint directory
+    checkpoint_dir = Path(config.checkpoint_dir)
+    checkpoint_dir.mkdir(parents=True, exist_ok=True)
+    # Training loop
+    steps_per_epoch = math.ceil(len(dataloader) / grad_accum)
+    max_steps = config.epochs * steps_per_epoch
+    tokens_per_step = config.batch_size * grad_accum * config.max_seq_len
+    total_train_tokens = config.epochs * len(dataloader) * config.batch_size * config.max_seq_len
+    step = start_step
+    model.train()
+    print(f"\nStarting training:")
+    print(f"  Epochs: {config.epochs}")
+    print(f"  Batch size: {config.batch_size}" +
+          (f" x {grad_accum} accum = {config.batch_size * grad_accum} effective" if grad_accum > 1 else ""))
+    print(f"  Steps per epoch: {steps_per_epoch}" +
+          (f" ({len(dataloader)} micro-batches)" if grad_accum > 1 else ""))
+    print(f"  Total steps: {max_steps}")
+    print(f"  Total tokens: {total_train_tokens:,} ({total_train_tokens/1e6:.1f}M)")
+    if num_params > 0:
+        print(f"  Tokens/param ratio: {total_train_tokens/num_params:.1f}x (Chinchilla=20x)")
+    print(f"  Learning rate: {config.learning_rate}" +
+          (f" → {config.min_lr}" if config.min_lr > 0 else ""))
+    print(f"  Mixed precision: {use_amp}")
+    print(f"  Validation: {'enabled' if val_dataloader else 'disabled'}")
+    print()
+    total_loss = 0.0
+    log_steps = 0
+    total_tokens_seen = step * tokens_per_step
+    # best_val_loss already set in resume section above
+    h_mid_buffer = None
+    last_align_val = float("inf")
+    start_time = time.time()
+    for epoch in range(start_epoch, config.epochs):
+        epoch_start = time.time()
+        epoch_loss = 0.0
+        epoch_steps = 0
+        micro_batches = []
+        epoch_micro_batches = skip_batches if epoch == start_epoch else 0
+        for batch_idx, batch in enumerate(dataloader):
+            # Skip already-processed batches on resume
+            if epoch == start_epoch and batch_idx < skip_batches:
+                continue
+            micro_batches.append(batch)
+            epoch_micro_batches += 1
+            # Accumulate micro-batches (flush at accum boundary or epoch end)
+            if len(micro_batches) < grad_accum and batch_idx < len(dataloader) - 1:
+                continue
+            n_micro = len(micro_batches)
+            actual_tokens = n_micro * config.batch_size * config.max_seq_len
+            # Update learning rate (per-group delays for staggered warmup)
+            for param_group in optimizer.param_groups:
+                delay = param_group.get("delay", 0)
+                param_group["lr"] = get_lr(step, config.warmup_steps, max_steps, config.learning_rate, min_lr=config.min_lr, delay=delay)
+            lr = optimizer.param_groups[0]["lr"]  # for logging
+            loss_ed_val = None
+            loss_align_val = None
+            grad_norm_mid = None
+            absorb_loss_val = None
+            # Update blend alpha for G²LU grafting
+            if model_type == "graft_g2lu":
+                raw_model.set_step(step)
+            # === Standard single-path training with accumulation ===
+            optimizer.zero_grad()
+            accum_loss = 0.0
+            accum_aux = 0.0
+            accum_align = 0.0
+            for mb in micro_batches:
+                mb_ids = mb["input_ids"].to(device)
+                mb_labels = mb["labels"].to(device)
+                word_positions = None
+                if word_start_table is not None:
+                    word_positions = compute_word_positions(mb_ids, word_start_table)
+                if use_amp:
+                    with autocast('cuda', dtype=amp_dtype):
+                        output = model(mb_ids, labels=mb_labels, word_positions=word_positions)
+                else:
+                    output = model(mb_ids, labels=mb_labels, word_positions=word_positions)
+                if scaler:
+                    scaler.scale(output["loss"] / n_micro).backward()
+                else:
+                    (output["loss"] / n_micro).backward()
+                accum_loss += output["loss"].item()
+                if "aux_loss" in output:
+                    accum_aux += output["aux_loss"].item()
+                if "align_loss" in output:
+                    accum_align += output["align_loss"].item()
+            if scaler:
+                scaler.unscale_(optimizer)
+            clip_params = list(raw_model.trainable_parameters()) if model_type == "graft_g2lu" else model.parameters()
+            grad_norm = nn.utils.clip_grad_norm_(clip_params, config.grad_clip).item()
+            if scaler:
+                scaler.step(optimizer)
+                scaler.update()
+            else:
+                optimizer.step()
+            optimizer.zero_grad()
+            loss_val = accum_loss / n_micro
+            aux_loss_val = accum_aux / n_micro if accum_aux > 0 else None
+            align_loss_val = accum_align / n_micro if accum_align > 0 else None
+            total_loss += loss_val
+            epoch_loss += loss_val
+            epoch_steps += 1
+            log_steps += 1
+            total_tokens_seen += actual_tokens
+            step += 1
+            # Logging
+            if step % config.log_every == 0:
+                avg_loss = total_loss / max(log_steps, 1)
+                ppl = math.exp(min(avg_loss, 20))
+                elapsed = time.time() - start_time
+                tok_s = (log_steps * tokens_per_step) / max(elapsed, 1e-6)
+                extra = ""
+                if aux_loss_val is not None:
+                    extra += f" | Aux {aux_loss_val:.3f}"
+                if align_loss_val is not None:
+                    extra += f" | Align {align_loss_val:.4f}"
+                print(
+                    f"Step {step:6d} | "
+                    f"Epoch {epoch+1}/{config.epochs} | "
+                    f"Loss {avg_loss:.4f} | "
+                    f"PPL {ppl:8.2f} | "
+                    f"GradN {grad_norm:.3f} | "
+                    f"LR {lr:.2e} | "
+                    f"Tok/s {tok_s:.0f}"
+                    f"{extra}"
+                )
+                total_loss = 0.0
+                log_steps = 0
+                start_time = time.time()
+            # Checkpointing
+            if step % config.save_every == 0:
+                ckpt_path = checkpoint_dir / f"step_{step:06d}.pt"
+                if model_type == "graft_g2lu":
+                    save_g2lu_checkpoint(raw_model, optimizer, step, epoch, loss_val, str(ckpt_path),
+                                        epoch_step=epoch_micro_batches, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                else:
+                    save_checkpoint(model, optimizer, step, epoch, loss_val, model_config, str(ckpt_path), model_type,
+                                   epoch_step=epoch_micro_batches, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                print(f"  Saved checkpoint: {ckpt_path}")
+                gc.collect()
+                torch.cuda.empty_cache()
+            # Mid-training validation
+            val_every = getattr(args, 'val_every', 0)
+            if val_every > 0 and step % val_every == 0 and val_dataloader:
+                val_loss, val_ppl = evaluate(config, model, val_dataloader, device, use_amp, amp_dtype, mid_run_eval=True, word_start_table=word_start_table)
+                avg_train = epoch_loss / max(epoch_steps, 1)
+                gap = val_loss - avg_train
+                print(f"  [Val @ step {step}] Loss: {val_loss:.4f} | PPL: {val_ppl:.2f} | Gap: {gap:+.4f}")
+                if val_loss < best_val_loss:
+                    best_val_loss = val_loss
+                    best_path = checkpoint_dir / "best.pt"
+                    if model_type == "graft_g2lu":
+                        save_g2lu_checkpoint(raw_model, optimizer, step, epoch, val_loss, str(best_path),
+                                            epoch_step=epoch_micro_batches, best_val_loss=val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                    else:
+                        save_checkpoint(model, optimizer, step, epoch, val_loss, model_config, str(best_path), model_type,
+                                       epoch_step=epoch_micro_batches, best_val_loss=val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                    print(f"  New best! Saved: {best_path}")
+                gc.collect()
+                torch.cuda.empty_cache()
+            micro_batches = []
+        # --- Epoch summary ---
+        epoch_elapsed = time.time() - epoch_start
+        avg_epoch_loss = epoch_loss / max(epoch_steps, 1)
+        epoch_ppl = math.exp(min(avg_epoch_loss, 20))
+        print(f"\n{'='*70}")
+        print(f"Epoch {epoch+1}/{config.epochs} complete in {epoch_elapsed:.0f}s")
+        print(f"  Train loss: {avg_epoch_loss:.4f} | Train PPL: {epoch_ppl:.2f}")
+        print(f"  Tokens seen: {total_tokens_seen:,} ({total_tokens_seen/1e6:.1f}M)")
+        # Validation
+        if val_dataloader:
+            val_loss, val_ppl = evaluate(config, model, val_dataloader, device, use_amp, amp_dtype, word_start_table=word_start_table)
+            gap = val_loss - avg_epoch_loss
+            print(f"  Val loss:   {val_loss:.4f} | Val PPL:   {val_ppl:.2f} | Gap: {gap:+.4f}")
+            if val_loss < best_val_loss:
+                best_val_loss = val_loss
+                best_path = checkpoint_dir / "best.pt"
+                if model_type == "graft_g2lu":
+                    save_g2lu_checkpoint(raw_model, optimizer, step, epoch + 1, val_loss, str(best_path),
+                                        epoch_step=0, best_val_loss=val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                else:
+                    save_checkpoint(model, optimizer, step, epoch + 1, val_loss, model_config, str(best_path), model_type,
+                                   epoch_step=0, best_val_loss=val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+                print(f"  New best! Saved: {best_path}")
+            # Free validation tensors
+            gc.collect()
+            torch.cuda.empty_cache()
+        print(f"{'='*70}\n")
+        # Save epoch checkpoint
+        ckpt_path = checkpoint_dir / f"epoch_{epoch+1:02d}.pt"
+        if model_type == "graft_g2lu":
+            save_g2lu_checkpoint(raw_model, optimizer, step, epoch + 1, avg_epoch_loss, str(ckpt_path),
+                                epoch_step=0, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+        else:
+            save_checkpoint(model, optimizer, step, epoch + 1, avg_epoch_loss, model_config, str(ckpt_path), model_type,
+                           epoch_step=0, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+        gc.collect()
+        torch.cuda.empty_cache()
+    # Save final checkpoint
+    if step == start_step:
+        print(f"\nNo training performed (already at step {step}/{max_steps}).")
+        print(f"  To train more epochs, increase --epochs beyond {config.epochs}.")
+    else:
+        final_path = checkpoint_dir / "latest.pt"
+        if model_type == "graft_g2lu":
+            save_g2lu_checkpoint(raw_model, optimizer, step, config.epochs, avg_epoch_loss, str(final_path),
+                                epoch_step=0, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+        else:
+            save_checkpoint(model, optimizer, step, config.epochs, avg_epoch_loss, model_config, str(final_path), model_type,
+                           epoch_step=0, best_val_loss=best_val_loss, scaler=scaler, tokenizer_name=tokenizer_name)
+        print(f"\nTraining complete.")
+        print(f"  Final train loss: {avg_epoch_loss:.4f} | PPL: {epoch_ppl:.2f}")
+        if val_dataloader:
+            print(f"  Best val loss: {best_val_loss:.4f} | PPL: {math.exp(min(best_val_loss, 20)):.2f}")
+        print(f"  Total tokens: {total_tokens_seen:,}")
+        print(f"  Checkpoints: {final_path}")
+if __name__ == "__main__":
+    train()