AbstractPhil
/

procrustes-analysis

Model card Files Files and versions

xet

Community

AbstractPhil commited on Mar 6

Commit

8ef0f56

verified ·

1 Parent(s): eaad7fb

Update README.md

Browse files

Files changed (1) hide show

README.md +196 -189

README.md CHANGED Viewed

@@ -7,128 +7,32 @@ license: mit
 # Geometric Terrain Statistics Composite
 Such a quaint little tool.
-```
-class GeometricResidualModulator(nn.Module):
-    def __init__(self, d_model=512, vocab_size=32128, n_geometric_dims=64,
-                 initial_alpha=0.01, n_layers=6):
-        super().__init__()
-        self.d_model = d_model
-        self.n_geometric_dims = n_geometric_dims
-        self.geometric_embed = nn.Embedding(vocab_size, n_geometric_dims)
-        self.proj = nn.Linear(n_geometric_dims, d_model, bias=False)
-        logit = math.log(initial_alpha / (1 - initial_alpha))
-        self.alpha = nn.Parameter(torch.full((n_layers,), logit))
-        nn.init.normal_(self.proj.weight, std=0.01)
-    def forward(self, residual, token_ids, layer_idx=0):
-        geo = self.geometric_embed(token_ids)
-        geo_projected = self.proj(geo)
-        a = torch.sigmoid(self.alpha[layer_idx])
-        return (1 - a) * residual + a * geo_projected
-    def geometric_residuals(self):
-        W = self.geometric_embed.weight
-        W_n = F.normalize(W, dim=1)
-        idx = torch.randperm(min(W.shape[0], 5000))[:5000]
-        sample = W_n[idx]
-        cos_mat = sample @ sample.T
-        tri = torch.triu_indices(len(idx), len(idx), offset=1)
-        flat_cos = cos_mat[tri[0], tri[1]]
-        norms = W.norm(dim=1)
-        centered = W - W.mean(dim=0)
-        cov = (centered.T @ centered) / W.shape[0]
-        eigvals = torch.linalg.eigvalsh(cov)
-        pr = (eigvals.sum() ** 2) / (eigvals ** 2).sum()
-        return {
-            'cos_mean': flat_cos.mean().item(),
-            'cos_std': flat_cos.std().item(),
-            'norm_mean': norms.mean().item(),
-            'pr_over_dim': (pr / self.n_geometric_dims).item(),
-            'alpha': torch.sigmoid(self.alpha).detach().cpu().numpy(),
-        }
-class ModulatedT5Encoder(nn.Module):
-    def __init__(self, t5_encoder, modulator, modulate_layers=None):
-        super().__init__()
-        self.encoder = t5_encoder
-        self.modulator = modulator
-        if modulate_layers is None:
-            modulate_layers = list(range(len(t5_encoder.block)))
-        self.modulate_layers = set(modulate_layers)
-    def forward(self, input_ids, attention_mask=None, output_hidden_states=False, **kwargs):
-        hidden_states = self.encoder.embed_tokens(input_ids)
-        hidden_states = self.encoder.dropout(hidden_states)
-        if attention_mask is not None:
-            extended_attention_mask = attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
-            extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(hidden_states.dtype).min
-        else:
-            extended_attention_mask = None
-        all_hidden_states = [hidden_states] if output_hidden_states else None
-        position_bias = None
-        seq_length = input_ids.shape[1]
-        cache_position = torch.arange(seq_length, device=input_ids.device)
-        for i, block in enumerate(self.encoder.block):
-            if i in self.modulate_layers:
-                hidden_states = self.modulator(hidden_states, input_ids, layer_idx=i)
-            block_output = block(hidden_states, attention_mask=extended_attention_mask,
-                                 position_bias=position_bias, cache_position=cache_position)
-            hidden_states = block_output[0]
-            if position_bias is None:
-                for out in block_output[1:]:
-                    if isinstance(out, torch.Tensor) and out.dim() == 4:
-                        position_bias = out
-                        break
-            if output_hidden_states:
-                all_hidden_states.append(hidden_states)
-        hidden_states = self.encoder.final_layer_norm(hidden_states)
-        hidden_states = self.encoder.dropout(hidden_states)
-        if output_hidden_states:
-            all_hidden_states.append(hidden_states)
-        return type('Output', (), {
-            'last_hidden_state': hidden_states,
-            'hidden_states': tuple(all_hidden_states) if all_hidden_states else None,
-        })()
-N_GEO = 64
-modulator = GeometricResidualModulator(
-    d_model=512, vocab_size=32128, n_geometric_dims=N_GEO,
-    initial_alpha=0.5, n_layers=6,
-).to(device)
-mod_encoder = ModulatedT5Encoder(
-    t5_encoder=model.encoder, modulator=modulator,
-    modulate_layers=[0, 1, 2, 3, 4, 5],
-)
-```
 ## Document Purpose
-Running catalog of geometric measurements across language models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.
 ---
 ## I. Models Profiled
-| Model | Params | Vocab | Hidden Dim | Layers | Architecture | Training Data |
-|---|---|---|---|---|---|---|
-| T5-Small | 60.5M | 32,128 | 512 | 6+6 enc-dec | Transformer (relative PE) | C4 |
-| Qwen3.5-0.8B | 853M (752M LM + 100M ViT) | 248,320 | 1024 | DeltaNet + MoE | Multilingual + Vision |
-| Qwen3.5-4B | ~4B | 248,320 | 2560 | DeltaNet + MoE | Multilingual + Vision |
 ---
@@ -172,7 +76,7 @@ Running catalog of geometric measurements across language models. Each metric in
 | Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
 | Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |
-**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm. This affects downstream metric scaling but not relational structure.
 ---
@@ -194,7 +98,7 @@ Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
 Vol = √(Vol²) if Vol² > 0, else invalid
 ```
-**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Compare to random Gaussian baseline (same norm distribution). Report CV (coefficient of variation = std/mean) and embed/random ratio.
 | Model | Valid/1000 | CV | Embed/Random Ratio |
 |---|---|---|---|
@@ -210,10 +114,9 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 **Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.
-| Comparison | Relational Pearson | Digit Structure Pearson |
 |---|---|---|
-| Qwen 0.8B vs 4B (raw) | 0.920 | 0.904 |
-| Qwen 0.8B vs 4B (Procrustes) | higher (post-alignment) | — |
 **Finding:** Models at different scales learn the same relational geometry (r=0.92).
@@ -225,21 +128,15 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 **Formula:** For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.
-**Process:** Encode each digit as single token, extract embedding, normalize, compute pairwise cosine matrix.
 | Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
 |---|---|---|---|---|
 | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
 | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
 | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |
-**Finding:** All models encode a number line. Stronger in Qwen (more training data). T5 has wider gap (adjacent vs non-adjacent more differentiated) despite weaker overall correlation.
-### IV.2 Semantic Category Clustering
-**Formula:** For tokens in a semantic category, compute mean intra-category pairwise cosine. Compare to global mean pairwise cosine. Lift = intra − global.
-**Process (T5-Small):** 8 hand-curated categories (animals, colors, numbers, body, food, emotions, actions, time), single-token words only.
 | Category | N tokens | Intra Cosine | Global | Lift |
 |---|---|---|---|---|
@@ -285,8 +182,6 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ### V.3 Encoder Distance Bands
-**Process:** Group WordNet token pairs by path similarity ranges. Measure mean cosine in each band.
 | WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
 |---|---|---|---|---|
 | [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
@@ -296,79 +191,148 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ### V.4 Hypernym Chain Decay
-**Process:** Find WordNet synsets forming hypernym chains (e.g., dog→canine→mammal→organism). Measure cosine between root and ancestor at each depth.
 | Depth | Static Cosine | Encoder Cosine |
 |---|---|---|
 | 1 | 0.160 | 0.656 |
-| 2 | 0.090 | 0.620 |
 | 3 | 0.075 | 0.594 |
 | 5 | 0.069 | 0.585 |
 | 7 | 0.068 | 0.579 |
-**Finding:** Monotonic decay in both spaces. Encoder has much stronger signal and cleaner gradient.
 ---
-## VI. Inactive Weight Topology (T5-Small / T5-Base)
-### VI.1 SVD Effective Rank
-**Formula:** Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.
-**Process:** SVD every 2D weight matrix. Report stable rank, participation ratio, active fraction (σᵢ > 0.01·σ₁), and condition number (σ₁/σₙ).
-| Weight Type | Stable Rank (Small) | Stable Rank (Base) |
-|---|---|---|
-| self_attn_q | 47.6 ± 16.4 | 58.1 ± 17.2 |
-| self_attn_k | 53.2 ± 9.2 | 62.4 ± 18.3 |
-| self_attn_v | 75.3 | 97.5 |
-| mlp_wi | 15.2 ± 3.8 | 20.6 ± 4.9 |
-| mlp_wo | 31.3 | 43.9 |
-### VI.2 Sparsity Topology
-**Formula:** Fraction of |wᵢⱼ| below threshold.
-| Weight Type | <0.1 (Small) | <0.1 (Base) |
 |---|---|---|
-| self_attn_q | **93.7%** | **99.4%** |
-| self_attn_k | 19.2% | 30.0% |
-| self_attn_v | 12.1% | 16.2% |
-| mlp_wi | 11.9% | 16.9% |
-| Full model | 18.4% | 27.9% |
-**Finding:** Q matrices are overwhelmingly sparse. The query projection is >93% empty. K matrices are dense. This asymmetry grows with scale. The Q null space is the intervention point for geometric modulation.
 ### VI.3 QK Similarity Manifold
 **Formula:** QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.
-**Process:** Compute per-layer. Track positive/negative balance and stable rank.
-| Layer (Encoder) | Stable Rank | Positive Eig | Negative Eig | Symmetry Dev |
-|---|---|---|---|---|
-| 0 | 39.5 | 315 | 197 | 0.993 |
-| 2 | 10.1 | 269 | 243 | 1.217 |
-| 5 | 5.35 | 274 | 238 | 1.252 |
-**Finding:** Similarity function narrows through depth (stable rank 39→5). Negative eigenvalue count increases — deeper layers define more anti-similarity boundaries.
 ### VI.4 MLP Dead Neurons
-**Formula:** Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂. Dead if < 1% of mean.
-**Finding:** Zero dead neurons across all layers, both encoder and decoder, at both Small and Base scale. T5 is parameter-starved — every neuron earns its keep.
-### VI.5 Position Bias Topology
-**Process:** T5 uses learned relative position biases: [32 buckets, N heads]. Measure per-head: monotonicity, distance correlation, peak bucket.
-**Encoder (T5-Small):** 3 local heads (peak 0-1, negative dist_corr), 2 global heads (peak 17-18, positive dist_corr), 3 mixed.
-**Decoder (T5-Small):** 4 far-looking heads (peak 31, values up to +48), 4 local heads (peak 0-1, values down to -34.5). Extreme magnitude asymmetry — far-looking heads are 10× stronger.
-**Finding:** This local/global split emerges identically across T5-Small, T5-Base. It's an architectural invariant.
 ---
@@ -384,12 +348,6 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ### VII.2 Geometric Embedding Initialization
-**Process:**
-1. Build 3000×3000 Wu-Palmer similarity matrix from WordNet anchors (~6 min)
-2. Eigendecompose → top 64 eigenvectors scaled by √eigenvalue → 64-d embeddings
-3. Project remaining tokens via GPU embedding cosine proxy (10-NN, softmax-weighted, <1 sec)
-4. Procrustes align projection matrix to encoder PCA space
 | Metric | Value |
 |---|---|
 | WN reconstruction correlation | 0.921 |
@@ -398,8 +356,6 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ### VII.3 Alpha Convergence
-**Process:** Freeze T5, train only modulator (geometric embed + projection + alpha). Task: summarize definition → lemma word. Track alpha per layer.
 | Start α | Final Mean α | Layer 5 Final | Pearson Δ | CV | Coherent | Basin |
 |---|---|---|---|---|---|---|
 | 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
@@ -407,8 +363,6 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 | 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
 | 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |
-**Finding:** Two stable attractor basins exist — binding (~0.07) and separation (~0.70). The binding basin produces functional results. Starting at 0.01 with early stopping (20 epochs) is optimal.
 ### VII.4 Depth Gradient (Consistent Across All Runs)
 | Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
@@ -435,9 +389,35 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ---
-## VIII. The 0.29154 Constant
-### VIII.1 Observations Across Systems
 | System | Context | Value |
 |---|---|---|
@@ -445,12 +425,13 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 | Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
 | Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
 | T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
-### VIII.2 T5 Generation Phase Transition
 | Alpha | Output (triangle prompt) |
 |---|---|
-| 0.01–0.10 | "triangle is a polygon with three edges and three vertices. it is one of the basic shapes in geometry." |
 | 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
 | 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
 | 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
@@ -462,7 +443,7 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 ---
-## IX. Universal Geometric Constants
 | Constant | Value | Observed In |
 |---|---|---|
@@ -470,11 +451,16 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 | Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
 | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
 | Depth gradient | Monotonic increasing | All modulator training runs |
-| Q sparsity scaling | Increases with model scale | T5-Small (93.7%), T5-Base (99.4%) |
 ---
-## X. Measurement Toolkit Reference
 | Tool | Input | Output | Requires Inference |
 |---|---|---|---|
@@ -484,12 +470,33 @@ Vol = √(Vol²) if Vol² > 0, else invalid
 | Digit Manifold | 10 digit token embeddings | |i−j| correlation, adjacency gap | No |
 | SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
 | QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
-| Dead Neuron Count | MLP wi, wo matrices | Combined importance distribution | No |
 | WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
 | Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |
 ---
-*Last updated: 2026-03-05*
-*Models profiled: 3 (T5-Small, Qwen3.5-0.8B, Qwen3.5-4B)*
-*Modulator experiments: 4 configurations*

 # Geometric Terrain Statistics Composite
 Such a quaint little tool.
+# Geometric Terrain Statistics Composite
 ## Document Purpose
+Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.
 ---
 ## I. Models Profiled
+| Model | Params | Vocab | Hidden Dim | Layers | Heads | Architecture | Training |
+|---|---|---|---|---|---|---|---|
+| T5-Small | 60.5M | 32,128 | 512 | 6+6 | 8 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
+| T5-Base | 222.9M | 32,128 | 768 | 12+12 | 12 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
+| T5-v1.1-XXL | 11.4B | 32,128 | 4096 | 24+24 | 64 | Enc-Dec (relative PE, **GeGLU** MLP) | C4 (v1.1 variant, no multi-task) |
+| BERT-large | 336.2M | 30,522 | 1024 | 24 | 16 | Encoder-only (absolute PE) | BookCorpus+Wikipedia MLM |
+| CLIP-ViT-B/16 | 85.5M (visual) | — | 768 | 12 | 12 | Vision encoder (fused QKV) | LAION-2B contrastive |
+| DINOv2-large | 302.0M | — | 1024 | 24 | 16 | Vision encoder (separate Q/K/V) | Self-supervised (no labels) |
+| CLIP-ViT-bigG/14 | 1.84B (visual) | — | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
+| Qwen3.5-0.8B | 853M | 248,320 | 1024 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
+| Qwen3.5-4B | ~4B | 248,320 | 2560 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
+**Notes:**
+- T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
+- CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
+- T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
 ---
 | Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
 | Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |
+**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.
 ---
 Vol = √(Vol²) if Vol² > 0, else invalid
 ```
+**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).
 | Model | Valid/1000 | CV | Embed/Random Ratio |
 |---|---|---|---|
 **Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.
+| Comparison | Relational Pearson | Pentachoron per-simplex corr |
 |---|---|---|
+| Qwen 0.8B vs 4B (raw) | 0.920 | 0.89 |
 **Finding:** Models at different scales learn the same relational geometry (r=0.92).
 **Formula:** For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.
 | Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
 |---|---|---|---|---|
 | T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
 | Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
 | Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |
+### IV.2 Semantic Category Clustering (T5-Small)
+**Formula:** Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra − global.
 | Category | N tokens | Intra Cosine | Global | Lift |
 |---|---|---|---|---|
 ### V.3 Encoder Distance Bands
 | WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
 |---|---|---|---|---|
 | [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
 ### V.4 Hypernym Chain Decay
 | Depth | Static Cosine | Encoder Cosine |
 |---|---|---|
 | 1 | 0.160 | 0.656 |
 | 3 | 0.075 | 0.594 |
 | 5 | 0.069 | 0.585 |
 | 7 | 0.068 | 0.579 |
 ---
+## VI. Cross-Architecture Inactive Weight Topology
+### VI.1 Q/K/V Sparsity (<0.1 threshold)
+**Formula:** Fraction of |wᵢⱼ| < 0.1 across all weights of that type.
+**Process:** Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.
+| Model | Q | K | V | O | MLP | Full Model |
+|---|---|---|---|---|---|---|
+| **T5-Small** (512d, 6L) | **93.7%** | 19.2% | 12.1% | 10.4% | 11.9% | 18.4% |
+| **T5-Base** (768d, 12L) | **99.4%** | 30.0% | 16.2% | 13.5% | 16.9% | 27.9% |
+| **T5-v1.1-XXL** (4096d, 24L) | **100.0%** | **65.5%** | 73.1% | 65.4% | ~57% | — |
+| BERT-large (1024d, 24L) | 99.1% | 99.1% | 99.9% | 99.9% | 99.4% | 99.3% |
+| DINOv2-large (1024d, 24L) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
+| CLIP-ViT-B/16 (768d, 12L) | — (fused) | — | — | — | 100.0% | 100.0% |
+| CLIP-ViT-bigG (1664d, 48L) | — (fused) | — | — | — | ~97% | 98.0% |
+**Key Finding — T5 Q/K Asymmetry Scales:**
+| Model | Q (<0.1) | K (<0.1) | Q/K Ratio |
+|---|---|---|---|
+| T5-Small | 93.7% | 19.2% | **4.9×** |
+| T5-Base | 99.4% | 30.0% | **3.3×** |
+| T5-v1.1-XXL | 100.0% | 65.5% | **1.5×** |
+T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is **functionally vestigial at scale**.
+**T5-v1.1-XXL Encoder vs Decoder:**
+| Component | Encoder | Decoder |
 |---|---|---|
+| self_attn_q | 100.0% | 100.0% |
+| self_attn_k | 71.7% | 59.4% |
+| self_attn_v | 76.0% | 70.1% |
+| cross_attn_q | — | 100.0% |
+| cross_attn_k | — | 63.1% |
+| cross_attn_v | — | 71.1% |
+Q is 100% sparse everywhere — self-attention and cross-attention, encoder and decoder.
+### VI.2 SVD Effective Rank
+**Formula:** Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.
+| Weight Type | T5-Small | T5-Base | T5-v1.1-XXL | BERT-large | DINOv2-large |
+|---|---|---|---|---|---|
+| self_attn_q | 47.6 | 58.1 | 96.8 | 50.8 | 57.7 |
+| self_attn_k | 53.2 | 62.4 | 90.0 | 37.7 | 55.5 |
+| self_attn_v | 75.3 | 97.5 | 204.4 | 113.0 | 94.8 |
+| self_attn_o | 25.4 | 35.0 | 16.4 | 125.0 | 85.6 |
+| mlp_up/gate | 15.2 | 20.6 | 67.9 (gate) / 247.3 (up) | 27.4 | 58.4 |
+| mlp_down | 31.3 | 43.9 | 25.3 | 52.2 | 94.4 |
+**T5-v1.1-XXL O matrices have very low stable rank (16.4)** — the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.
 ### VI.3 QK Similarity Manifold
 **Formula:** QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.
+**Positive Eigenvalue Fraction Trends:**
+| Model | First Layer | Last Layer | Trend |
+|---|---|---|---|
+| T5-Small encoder | 0.615 | 0.535 | **−0.080** (decreasing) |
+| T5-v1.1-XXL encoder | 0.510 | 0.503 | **−0.007** (flat) |
+| T5-v1.1-XXL decoder self | 0.501 | 0.548 | **+0.047** (increasing) |
+| **T5-v1.1-XXL cross-attn** | **0.500** | **0.500** | **0.000 (locked)** |
+| BERT-large | 0.446 | 0.513 | +0.066 (increasing) |
+| CLIP-ViT-B/16 | 0.503 | 0.538 | +0.035 (increasing) |
+| DINOv2-large | 0.498 | 0.548 | +0.050 (increasing) |
+| CLIP-ViT-bigG | 0.498 | 0.582 | +0.084 (increasing) |
+**Critical Finding — Cross-Attention is Perfectly Balanced:**
+T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= √2) everywhere. This is a locked equilibrium — the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.
+**T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout).** Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.
+**BERT starts BELOW 0.50 (0.446).** The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.
 ### VI.4 MLP Dead Neurons
+**Formula:** Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (ReLU) or ‖wᵢ_gate‖₂ · ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (GeGLU). Dead if < 1% of mean.
+| Model | Dead (<1% mean) | Weak (<10% mean) | Notes |
+|---|---|---|---|
+| T5-Small (enc+dec) | 0/24,576 (0.00%) | 0/24,576 (0.00%) | All neurons alive |
+| T5-Base (enc+dec) | 0/73,728 (0.00%) | 0/73,728 (0.00%) | All neurons alive |
+| T5-v1.1-XXL encoder | 0/245,760 (0.00%) | 0/245,760 (0.00%) | All neurons alive |
+| T5-v1.1-XXL decoder | **14/245,760 (0.01%)** | **461/245,760 (0.19%)** | First dead neurons in T5 family |
+| BERT-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
+| DINOv2-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
+| CLIP-ViT-B/16 | **1,316/36,864 (3.57%)** | 1,356/36,864 (3.68%) | Only model with significant dead neurons |
+| CLIP-ViT-bigG | 0/393,216 (0.00%) | **24,163/393,216 (6.14%)** | 0 dead but 6% weak |
+**Finding:** T5-v1.1-XXL decoder has the first dead neurons in the T5 family — 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons — contrastive training at small scale produces genuine pruning.
+### VI.5 Cross-Layer Weight Correlation
+**Formula:** cos(flatten(Wᵢ), flatten(Wⱼ)) between weight matrices of the same type at different layers.
+| Model | Q adj mean | K adj mean | MLP_up adj mean |
+|---|---|---|---|
+| T5-Small | ~0.000 | ~0.000 | 0.031–0.045 |
+| T5-Base | ~0.000 | ~0.000 | 0.024–0.036 |
+| T5-v1.1-XXL encoder | 0.0001 | — | — |
+| T5-v1.1-XXL decoder | −0.0001 | — | — |
+| BERT-large | 0.0002 | 0.0003 | 0.032 |
+| CLIP-ViT-B/16 | −0.0004 (QKV) | — | 0.008 |
+| DINOv2-large | −0.0003 | −0.0002 | 0.006 |
+| CLIP-ViT-bigG | 0.0000 (QKV) | — | 0.055 |
+**Universal finding:** Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance — feedforward layers share structure.
+### VI.6 Position Bias Topology
+**T5 uses learned relative position biases:** [32 buckets × N_heads].
+| Model | Encoder | Decoder |
+|---|---|---|
+| T5-Small (8 heads) | 3 local, 2 global, 3 mixed | 4 local, 4 global, 0 mixed |
+| T5-Base (12 heads) | 4 local, 3 global, 5 mixed | 5 local, 4 global, 3 mixed |
+| T5-v1.1-XXL (64 heads) | **24 local, 2 global, 38 mixed** | **27 local, 37 global, 0 mixed** |
+**T5-v1.1-XXL position findings:**
+- Encoder: 38/64 mixed heads — nuanced position sensitivity at scale
+- **Decoder: ZERO mixed heads** — perfect binary crystallization. Every head is either pure local or pure global
+- Decoder is 58% global (37/64) — overwhelmingly biased toward long-range attention
+- Encoder range: [-47.2, 11.2] — strong local suppression
+- Decoder range: [-28.4, 17.0] — more balanced
+**Finding:** The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.
 ---
 ### VII.2 Geometric Embedding Initialization
 | Metric | Value |
 |---|---|
 | WN reconstruction correlation | 0.921 |
 ### VII.3 Alpha Convergence
 | Start α | Final Mean α | Layer 5 Final | Pearson Δ | CV | Coherent | Basin |
 |---|---|---|---|---|---|---|
 | 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
 | 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
 | 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |
 ### VII.4 Depth Gradient (Consistent Across All Runs)
 | Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
 ---
+## VIII. Geometric Field Modulator (Multi-Expert)
+### VIII.1 Architecture
+- Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
+- **Multiplicative gating**: residual × Π(blended_gates) — valid regions pass, invalid suppressed
+- **Soft blending**: per expert gate = (1 − α) + α × expert_gate
+- **Null space**: 25% of residual dimensions untouched by modulator
+- **Alpha clamped**: [0.001, 0.35] — hard ceiling below the phase boundary
+- **Gradient scaling**: geometric params at 10% LR, alpha at 50% LR, gates at full LR
+- Params: **38,552** (0.064% of T5-Small)
+- Self-test: validity=0.985, null space preserved, template volumes sane
+### VIII.2 Design Rationale (Grounded in Cross-Architecture Data)
+| Data Point | Design Decision |
+|---|---|
+| Q sparsity 100% at scale | Geometric field can replace Q — the model barely uses it |
+| Cross-attn QK locked at 0.500 | Target equilibrium for geometric validity gating |
+| Depth gradient always increasing | Per-layer alpha respects this (low early, high late) |
+| Zero dead MLP neurons | Don't touch MLPs — all capacity is in use |
+| Decoder position: binary L/G split | Modulator preserves positional structure (null space) |
+| CV 0.20–0.23 universal | CV monitoring as health check, not loss |
+---
+## IX. The 0.29154 Constant
+### IX.1 Observations Across Systems
 | System | Context | Value |
 |---|---|---|
 | Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
 | Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
 | T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
+| Alpha training basins | 0.70 start → settled at 0.695 | Mirror constant 1 − 0.29154 = 0.70846, Δ = 0.013 |
+### IX.2 T5 Generation Phase Transition
 | Alpha | Output (triangle prompt) |
 |---|---|
+| 0.01–0.10 | "...three edges and three vertices. it is one of the basic shapes in geometry." |
 | 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
 | 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
 | 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
 ---
+## X. Universal Geometric Constants
 | Constant | Value | Observed In |
 |---|---|---|
 | Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
 | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
 | Depth gradient | Monotonic increasing | All modulator training runs |
+| Q sparsity scaling (T5) | 93.7% → 99.4% → 100.0% | T5-Small → T5-Base → T5-v1.1-XXL |
+| Cross-attn QK balance | Locked at 0.500 | T5-v1.1-XXL (all 24 layers) |
+| Attention cross-layer corr | ~0.000 | ALL models profiled (8 models) |
+| MLP cross-layer corr | 0.006–0.055 (positive, decays) | ALL models profiled |
+| Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
+| MLP full utilization | 0.00% dead neurons | T5 family (enc), BERT, DINOv2 |
 ---
+## XI. Measurement Toolkit Reference
 | Tool | Input | Output | Requires Inference |
 |---|---|---|---|
 | Digit Manifold | 10 digit token embeddings | |i−j| correlation, adjacency gap | No |
 | SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
 | QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
+| Dead Neuron Count | MLP wi/gate/up, wo matrices | Combined importance distribution | No |
+| Cross-Layer Correlation | Same-type weight matrices | Adjacent cosine similarity | No |
+| Position Bias Topology | Relative attention bias tensor | Local/global/mixed head counts | No |
+| Sparsity Topology | Any weight matrix | Fraction below threshold | No |
 | WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
 | Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |
 ---
+## XII. Scripts Reference
+| Script | Purpose | Key Outputs |
+|---|---|---|
+| `probe_t5_small_terrain.py` | T5-Small embedding + layer geometry | PR, CV, digit manifold, layer evolution |
+| `probe_t5_wordnet_summarize.py` | T5-Small × WordNet relational alignment | Pearson, Spearman, distance bands, hypernym decay |
+| `probe_t5_wordnet_50seeds.py` | 50-seed stability test (GPU-accelerated) | Confidence intervals for all relational metrics |
+| `probe_t5_inactive_weights.py` | T5-Small/Base inactive weight topology | SVD, sparsity, QK manifold, dead neurons |
+| `cross_architecture_weight_battery.py` | BERT + CLIP + DINOv2 battery | Cross-model comparison table |
+| `probe_flux_t5_g4.py` | T5-v1.1-XXL (Flux encoder) full battery | All layers, encoder + decoder + cross-attn |
+| `geometric_residual_modulator.py` | LERP modulator + training utilities | Modulator class + measurement tools |
+| `geometric_field_modulator.py` | Multi-expert field modulator | KSimplex experts + multiplicative gating |
+| `geometric_modulator_full_pipeline.py` | Self-contained T5 + WordNet + modulator | End-to-end pipeline |
+| `train_modulator.py` | Training loop for alpha convergence | Freeze T5, train modulator, track alpha |
+---
+*Last updated: 2026-03-06*
+*Models profiled: 9 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B)*
+*Cross-architecture battery: 7 models, 4 training objectives (MLM, span corruption, contrastive, self-supervised)*
+*Modulator experiments: 4 LERP configurations, 1 field modulator*