File size: 34,705 Bytes

---
license: mit
---

# Day 2
# Geometric Terrain Statistics Composite

## Document Purpose

Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.

---

## I. Models Profiled

| Model | Params | Vocab | Hidden Dim | Layers | Heads | Architecture | Training |
|---|---|---|---|---|---|---|---|
| T5-Small | 60.5M | 32,128 | 512 | 6+6 | 8 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-Base | 222.9M | 32,128 | 768 | 12+12 | 12 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-v1.1-XXL | 11.4B | 32,128 | 4096 | 24+24 | 64 | Enc-Dec (relative PE, **GeGLU** MLP) | C4 (v1.1 variant, no multi-task) |
| BERT-large | 336.2M | 30,522 | 1024 | 24 | 16 | Encoder-only (absolute PE) | BookCorpus+Wikipedia MLM |
| CLIP-ViT-B/16 | 85.5M (visual) | — | 768 | 12 | 12 | Vision encoder (fused QKV) | LAION-2B contrastive |
| DINOv2-large | 302.0M | — | 1024 | 24 | 16 | Vision encoder (separate Q/K/V) | Self-supervised (no labels) |
| CLIP-ViT-bigG/14 | 1.84B (visual) | — | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
| Qwen3.5-0.8B | 853M | 248,320 | 1024 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
| Qwen3.5-4B | ~4B | 248,320 | 2560 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
| T5Gemma2-1B-1B | 2.1B | 262,144 | 1152 | 27+26 | GQA 4:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
| T5Gemma2-4B-4B | 7.5B | 262,144 | 2560 | 34+34 | GQA 2:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
| SD 1.5 UNet | 860M | — | [320,640,1280,1280] | 16 attn blocks | 8 | Conv UNet + self/cross attn | LDM diffusion (LAION) |
| SDXL UNet | 2.6B | — | [320,640,1280] | 70 attn blocks | [5,10,20] | Conv UNet + self/cross attn | LDM diffusion (internal) |
| SD 1.5 VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (LAION) |
| SDXL VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (internal) |
| Flux.1 VAE | 83.8M | — | 16 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |
| Flux.2 VAE | 84.0M | — | 32 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |

**Notes:**
- T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
- CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
- T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
- T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
- UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
- VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
- VAE attention exists only at the bottleneck (mid_block) — one in encoder, one in decoder

---

## II. Embedding Geometry Metrics

### II.1 Participation Ratio (Effective Dimensionality)

**Formula:** PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.

**Process:** Center embeddings (subtract mean), compute covariance C = EᵀE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].

| Model | PR | PR / dim | Dims for 95% var |
|---|---|---|---|
| T5-Small (512d) | 287.2 | **0.561** | 379 (74.0%) |
| Qwen3.5-0.8B (1024d) | 547.7 | **0.535** | 893 (87.2%) |
| Qwen3.5-4B (2560d) | 812.4 | **0.317** | 2125 (83.0%) |

**Finding:** PR/dim ≈ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.

### II.2 Pairwise Cosine Similarity Distribution

**Formula:** cos(eᵢ, eⱼ) = (eᵢ · eⱼ) / (‖eᵢ‖ · ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).

**Process:** Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.

| Model | Mean | Std | Median | 1% | 99% |
|---|---|---|---|---|---|
| T5-Small | 0.057 | 0.060 | 0.053 | -0.068 | 0.225 |
| Qwen3.5-0.8B | 0.195 | 0.085 | 0.197 | -0.016 | 0.408 |
| Qwen3.5-4B | 0.142 | 0.078 | 0.139 | -0.029 | 0.356 |

**Finding:** T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).

### II.3 Embedding Norm Distribution

**Formula:** ‖eᵢ‖₂ = √(Σeᵢⱼ²)

| Model | Mean Norm | Std | Min | Max |
|---|---|---|---|---|
| T5-Small | 520.15 | 69.84 | 243.31 | 1333.61 |
| Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
| Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |

**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.

---

## III. Simplex Geometry Metrics

### III.1 Pentachoron Volume (Cayley-Menger Determinant)

**Formula:** For 5 points P₀...P₄, construct the bordered distance matrix:

```
D = | 0  1    1    1    1    1   |
    | 1  0    d₀₁² d₀₂² d₀₃² d₀₄²|
    | 1  d₁₀² 0    d₁₂² d₁₃² d₁₄²|
    | 1  d₂₀² d₂₁² 0    d₂₃² d₂₄²|
    | 1  d₃₀² d₃₁² d₃₂² 0    d₃₄²|
    | 1  d₄₀² d₄₁² d₄₂² d₄₃² 0   |

Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid
```

**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).

| Model | Valid/1000 | CV | Embed/Random Ratio |
|---|---|---|---|
| T5-Small | 1000 | **0.233** | 0.855 |
| Qwen3.5-0.8B | 1000 | **0.208** | 0.984 |
| Qwen3.5-4B | 1000 | **0.222** | 0.988 |

**Finding:** CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."

### III.2 Cross-Model Relational Structure

**Formula:** For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.

**Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.

| Comparison | Relational Pearson | Pentachoron per-simplex corr |
|---|---|---|
| Qwen 0.8B vs 4B (raw) | 0.920 | 0.89 |

**Finding:** Models at different scales learn the same relational geometry (r=0.92).

---

## IV. Semantic Structure Metrics

### IV.1 Digit Manifold

**Formula:** For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.

| Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
|---|---|---|---|---|
| T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
| Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
| Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |

### IV.2 Semantic Category Clustering (T5-Small)

**Formula:** Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra − global.

| Category | N tokens | Intra Cosine | Global | Lift |
|---|---|---|---|---|
| numbers | 9 | 0.497 | 0.057 | +0.440 |
| colors | 10 | 0.421 | 0.057 | +0.365 |
| time | 10 | 0.351 | 0.057 | +0.294 |
| food | 10 | 0.248 | 0.057 | +0.191 |
| animals | 12 | 0.241 | 0.057 | +0.184 |
| body | 10 | 0.216 | 0.057 | +0.159 |
| emotions | 10 | 0.197 | 0.057 | +0.141 |
| actions | 9 | 0.183 | 0.057 | +0.126 |

---

## V. Encoder Transformation Metrics (T5-Small)

### V.1 Layer-by-Layer Geometry

**Process:** Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.

| Layer | Mean Norm | Pairwise Cosine |
|---|---|---|
| 0 (embed) | 377.3 | 0.052 |
| 1 | 761.6 | 0.278 |
| 2 | 1092.6 | 0.330 |
| 3 | 1428.8 | 0.367 |
| 4 | 1829.1 | 0.382 |
| 5 | 2378.3 | 0.419 |
| 6 (post-LN) | 3.3 | 0.211 |

**Finding:** Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.

### V.2 WordNet Relational Alignment

**Process:** Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.

| Representation | Pearson | Spearman |
|---|---|---|
| Static embeddings | 0.078 | 0.015 |
| Encoder output | 0.095 | 0.081 |

**50-seed stability (encoder):** Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.

### V.3 Encoder Distance Bands

| WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
|---|---|---|---|---|
| [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
| [0.25, 0.50) | 53,112 | 0.077 | 0.573 | +0.496 |
| [0.10, 0.25) | 145,035 | 0.060 | 0.565 | +0.505 |
| [0.05, 0.10) | 295,680 | 0.061 | 0.553 | +0.492 |

### V.4 Hypernym Chain Decay

| Depth | Static Cosine | Encoder Cosine |
|---|---|---|
| 1 | 0.160 | 0.656 |
| 3 | 0.075 | 0.594 |
| 5 | 0.069 | 0.585 |
| 7 | 0.068 | 0.579 |

---

## VI. Cross-Architecture Inactive Weight Topology

### VI.1 Q/K/V Sparsity (<0.1 threshold)

**Formula:** Fraction of |wᵢⱼ| < 0.1 across all weights of that type.

**Process:** Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.

| Model | Q | K | V | O | MLP | Full Model |
|---|---|---|---|---|---|---|
| **T5-Small** (512d, 6L) | **93.7%** | 19.2% | 12.1% | 10.4% | 11.9% | 18.4% |
| **T5-Base** (768d, 12L) | **99.4%** | 30.0% | 16.2% | 13.5% | 16.9% | 27.9% |
| **T5-v1.1-XXL** (4096d, 24L) | **100.0%** | **65.5%** | 73.1% | 65.4% | ~57% | — |
| BERT-large (1024d, 24L) | 99.1% | 99.1% | 99.9% | 99.9% | 99.4% | 99.3% |
| DINOv2-large (1024d, 24L) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| CLIP-ViT-B/16 (768d, 12L) | — (fused) | — | — | — | 100.0% | 100.0% |
| CLIP-ViT-bigG (1664d, 48L) | — (fused) | — | — | — | ~97% | 98.0% |

**Key Finding — T5 Q/K Asymmetry Scales:**

| Model | Q (<0.1) | K (<0.1) | Q/K Ratio |
|---|---|---|---|
| T5-Small | 93.7% | 19.2% | **4.9×** |
| T5-Base | 99.4% | 30.0% | **3.3×** |
| T5-v1.1-XXL | 100.0% | 65.5% | **1.5×** |

T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is **functionally vestigial at scale**.

**T5-v1.1-XXL Encoder vs Decoder:**

| Component | Encoder | Decoder |
|---|---|---|
| self_attn_q | 100.0% | 100.0% |
| self_attn_k | 71.7% | 59.4% |
| self_attn_v | 76.0% | 70.1% |
| cross_attn_q | — | 100.0% |
| cross_attn_k | — | 63.1% |
| cross_attn_v | — | 71.1% |

Q is 100% sparse everywhere — self-attention and cross-attention, encoder and decoder.

### VI.2 SVD Effective Rank

**Formula:** Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.

| Weight Type | T5-Small | T5-Base | T5-v1.1-XXL | BERT-large | DINOv2-large |
|---|---|---|---|---|---|
| self_attn_q | 47.6 | 58.1 | 96.8 | 50.8 | 57.7 |
| self_attn_k | 53.2 | 62.4 | 90.0 | 37.7 | 55.5 |
| self_attn_v | 75.3 | 97.5 | 204.4 | 113.0 | 94.8 |
| self_attn_o | 25.4 | 35.0 | 16.4 | 125.0 | 85.6 |
| mlp_up/gate | 15.2 | 20.6 | 67.9 (gate) / 247.3 (up) | 27.4 | 58.4 |
| mlp_down | 31.3 | 43.9 | 25.3 | 52.2 | 94.4 |

**T5-v1.1-XXL O matrices have very low stable rank (16.4)** — the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.

### VI.3 QK Similarity Manifold

**Formula:** QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.

**Positive Eigenvalue Fraction Trends:**

| Model | First Layer | Last Layer | Trend |
|---|---|---|---|
| T5-Small encoder | 0.615 | 0.535 | **−0.080** (decreasing) |
| T5-v1.1-XXL encoder | 0.510 | 0.503 | **−0.007** (flat) |
| T5-v1.1-XXL decoder self | 0.501 | 0.548 | **+0.047** (increasing) |
| **T5-v1.1-XXL cross-attn** | **0.500** | **0.500** | **0.000 (locked)** |
| BERT-large | 0.446 | 0.513 | +0.066 (increasing) |
| CLIP-ViT-B/16 | 0.503 | 0.538 | +0.035 (increasing) |
| DINOv2-large | 0.498 | 0.548 | +0.050 (increasing) |
| CLIP-ViT-bigG | 0.498 | 0.582 | +0.084 (increasing) |

**Critical Finding — Cross-Attention is Perfectly Balanced:**

T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= √2) everywhere. This is a locked equilibrium — the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.

**T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout).** Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.

**BERT starts BELOW 0.50 (0.446).** The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.

### VI.4 MLP Dead Neurons

**Formula:** Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (ReLU) or ‖wᵢ_gate‖₂ · ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂ (GeGLU). Dead if < 1% of mean.

| Model | Dead (<1% mean) | Weak (<10% mean) | Notes |
|---|---|---|---|
| T5-Small (enc+dec) | 0/24,576 (0.00%) | 0/24,576 (0.00%) | All neurons alive |
| T5-Base (enc+dec) | 0/73,728 (0.00%) | 0/73,728 (0.00%) | All neurons alive |
| T5-v1.1-XXL encoder | 0/245,760 (0.00%) | 0/245,760 (0.00%) | All neurons alive |
| T5-v1.1-XXL decoder | **14/245,760 (0.01%)** | **461/245,760 (0.19%)** | First dead neurons in T5 family |
| BERT-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| DINOv2-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| CLIP-ViT-B/16 | **1,316/36,864 (3.57%)** | 1,356/36,864 (3.68%) | Only model with significant dead neurons |
| CLIP-ViT-bigG | 0/393,216 (0.00%) | **24,163/393,216 (6.14%)** | 0 dead but 6% weak |

**Finding:** T5-v1.1-XXL decoder has the first dead neurons in the T5 family — 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons — contrastive training at small scale produces genuine pruning.

### VI.5 Cross-Layer Weight Correlation

**Formula:** cos(flatten(Wᵢ), flatten(Wⱼ)) between weight matrices of the same type at different layers.

| Model | Q adj mean | K adj mean | MLP_up adj mean |
|---|---|---|---|
| T5-Small | ~0.000 | ~0.000 | 0.031–0.045 |
| T5-Base | ~0.000 | ~0.000 | 0.024–0.036 |
| T5-v1.1-XXL encoder | 0.0001 | — | — |
| T5-v1.1-XXL decoder | −0.0001 | — | — |
| BERT-large | 0.0002 | 0.0003 | 0.032 |
| CLIP-ViT-B/16 | −0.0004 (QKV) | — | 0.008 |
| DINOv2-large | −0.0003 | −0.0002 | 0.006 |
| CLIP-ViT-bigG | 0.0000 (QKV) | — | 0.055 |

**Universal finding:** Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance — feedforward layers share structure.

### VI.6 Position Bias Topology

**T5 uses learned relative position biases:** [32 buckets × N_heads].

| Model | Encoder | Decoder |
|---|---|---|
| T5-Small (8 heads) | 3 local, 2 global, 3 mixed | 4 local, 4 global, 0 mixed |
| T5-Base (12 heads) | 4 local, 3 global, 5 mixed | 5 local, 4 global, 3 mixed |
| T5-v1.1-XXL (64 heads) | **24 local, 2 global, 38 mixed** | **27 local, 37 global, 0 mixed** |

**T5-v1.1-XXL position findings:**
- Encoder: 38/64 mixed heads — nuanced position sensitivity at scale
- **Decoder: ZERO mixed heads** — perfect binary crystallization. Every head is either pure local or pure global
- Decoder is 58% global (37/64) — overwhelmingly biased toward long-range attention
- Encoder range: [-47.2, 11.2] — strong local suppression
- Decoder range: [-28.4, 17.0] — more balanced

**Finding:** The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.

---

## VII. Geometric Residual Modulator

### VII.1 Architecture

- Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
- Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
- Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
- Intervention: residual_out = (1 − α) · residual + α · proj(geo_embed(token_ids))
- Params: 2.09M (3.45% of T5-Small)

### VII.2 Geometric Embedding Initialization

| Metric | Value |
|---|---|
| WN reconstruction correlation | 0.921 |
| Procrustes alignment cosine | 0.372 |
| Eigenvalue cumulative (top 64) | 61.3% |

### VII.3 Alpha Convergence

| Start α | Final Mean α | Layer 5 Final | Pearson Δ | CV | Coherent | Basin |
|---|---|---|---|---|---|---|
| 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
| 0.20 (20 ep) | 0.222 | 0.308 | +0.085 | 0.452 | No | Ridge |
| 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
| 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |

### VII.4 Depth Gradient (Consistent Across All Runs)

| Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
|---|---|---|---|
| 0 | 0.015 | 0.035 | 0.170 |
| 1 | 0.052 | 0.061 | 0.180 |
| 2 | 0.066 | 0.102 | 0.227 |
| 3 | 0.080 | 0.137 | 0.197 |
| 4 | 0.080 | 0.197 | 0.248 |
| 5 | 0.107 | 0.218 | 0.308 |

**Finding:** Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.

### VII.5 Best Result

| Metric | Original | Modulated (20ep, α=0.01 start) | Change |
|---|---|---|---|
| WordNet Pearson | 0.099 | **0.250** | **+152%** |
| WordNet Spearman | 0.085 | **0.245** | **+189%** |
| Semantic Gradient | 0.022 | **0.052** | **+132%** |
| Pentachoron CV | 0.202 | **0.220** | Stayed in band |
| Per-token Preservation | — | 0.730 | — |
| Coherence | Baseline | **Identical on 4/4 tests** | — |

---

## VIII. Geometric Field Modulator (Multi-Expert)

### VIII.1 Architecture

- Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
- **Multiplicative gating**: residual × Π(blended_gates) — valid regions pass, invalid suppressed
- **Soft blending**: per expert gate = (1 − α) + α × expert_gate
- **Null space**: 25% of residual dimensions untouched by modulator
- **Alpha clamped**: [0.001, 0.35] — hard ceiling below the phase boundary
- **Gradient scaling**: geometric params at 10% LR, alpha at 50% LR, gates at full LR
- Params: **38,552** (0.064% of T5-Small)
- Self-test: validity=0.985, null space preserved, template volumes sane

### VIII.2 Design Rationale (Grounded in Cross-Architecture Data)

| Data Point | Design Decision |
|---|---|
| Q sparsity 100% at scale | Geometric field can replace Q — the model barely uses it |
| Cross-attn QK locked at 0.500 | Target equilibrium for geometric validity gating |
| Depth gradient always increasing | Per-layer alpha respects this (low early, high late) |
| Zero dead MLP neurons | Don't touch MLPs — all capacity is in use |
| Decoder position: binary L/G split | Modulator preserves positional structure (null space) |
| CV 0.20–0.23 universal | CV monitoring as health check, not loss |

---

## IX. The 0.29154 Constant

### IX.1 Observations Across Systems

| System | Context | Value |
|---|---|---|
| MinimalShunts | CLIP-L ↔ CLIP-G projection gate | Emergent equilibrium |
| Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
| Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
| T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
| Alpha training basins | 0.70 start → settled at 0.695 | Mirror constant 1 − 0.29154 = 0.70846, Δ = 0.013 |

### IX.2 T5 Generation Phase Transition

| Alpha | Output (triangle prompt) |
|---|---|
| 0.01–0.10 | "...three edges and three vertices. it is one of the basic shapes in geometry." |
| 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
| 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
| 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.2915 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.292 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **the world**." |
| 0.30 | "a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |

**Finding:** 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.

---

## X. Universal Geometric Constants

| Constant | Value | Observed In |
|---|---|---|
| Pentachoron CV | 0.20–0.23 | T5-Small, Qwen 0.8B, Qwen 4B, trained modulator |
| Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
| Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
| Depth gradient | Monotonic increasing | All modulator training runs |
| Q sparsity scaling (T5) | 93.7% → 99.4% → 100.0% | T5-Small → T5-Base → T5-v1.1-XXL |
| Q sparsity asymmetry | **T5 pretraining only** | Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs |
| Cross-modal QK balance | **Locked at 0.500** | T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models) |
| Self-attn QK: adapted models | **Locked at 0.500** | T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers) |
| UNet QK U-gradient | down→repulsion, up→attraction | SD 1.5 (0.451→0.581), SDXL (0.477→0.549) |
| VAE decoder QK | Repulsion-biased | SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416) |
| Attention cross-layer corr | ~0.000 | ALL 17 models, including UNets and VAEs |
| Conv cross-layer corr | ~0.000 | All UNets and VAEs (extends to pure convnets) |
| MLP/FF full utilization | 0.00% dead | T5 family (enc), BERT, DINOv2, UNets, all VAEs |
| Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
| VAE spectral invariant | Pearson 0.94–0.98 | All 6 VAE pairs — SV distribution is architecture-determined |
| VAE Procrustes alignment | 70–76% cosine | All 6 pairs — same solution in different coordinate systems |

---

## XI. Measurement Toolkit Reference

| Tool | Input | Output | Requires Inference |
|---|---|---|---|
| Participation Ratio | Embedding matrix | Effective dimensionality | No |
| Cayley-Menger Volume | 5-point subsets of embeddings | Simplex volume + CV | No |
| Pairwise Cosine | Embedding matrix (sampled) | Similarity distribution | No |
| Digit Manifold | 10 digit token embeddings | |i−j| correlation, adjacency gap | No |
| SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
| QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
| Dead Neuron Count | MLP wi/gate/up, wo matrices | Combined importance distribution | No |
| Cross-Layer Correlation | Same-type weight matrices | Adjacent cosine similarity | No |
| Position Bias Topology | Relative attention bias tensor | Local/global/mixed head counts | No |
| Sparsity Topology | Any weight matrix | Fraction below threshold | No |
| WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
| Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |

---

## XII. T5Gemma2 — Decoder-Adapted Encoder-Decoder

**Architecture:** Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).

### XII.1 Sparsity

| Model | Q (<0.1) | K (<0.1) | V (<0.1) | Pattern |
|---|---|---|---|---|
| T5Gemma2 1B-1B | 100.0% | 99.9% | 100.0% | **Uniform** |
| T5Gemma2 4B-4B | 100.0% | 100.0% | 100.0% | **Uniform** |

**Finding:** No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.

### XII.2 QK Manifold

| Model | Encoder Self | Decoder Self | All Layers |
|---|---|---|---|
| T5Gemma2 1B | 0.500 (±0.001) | 0.500 (±0.001) | **Locked** |
| T5Gemma2 4B | 0.500 exact | 0.500 exact | **Locked** |

**Finding:** Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.

### XII.3 Other Invariants

- Dead neurons: 0/359,424 (1B), 0/696,320 (4B) — all alive
- Cross-layer Q correlation: ~0.000 — confirmed universal
- MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
- GQA: 4:1 at 1B scale, 2:1 at 4B scale

---

## XIII. Diffusion UNet Weight Topology

### XIII.1 UNet Sparsity

| Model | Self Q | Self K | Self V | Cross Q | Cross K | Cross V |
|---|---|---|---|---|---|---|
| SD 1.5 UNet | **90.5%** | **90.9%** | 97.1% | 96.8% | 94.9% | 98.9% |
| SDXL UNet | 99.9% | 99.9% | 100.0% | 100.0% | 100.0% | 100.0% |

**SD 1.5 is the least sparse model in the entire battery.** 90.5% for self-attention Q — below T5-Small's 93.7%. A parameter-starved model (860M for 512×512 image generation) uses denser weights. SDXL at 3× the params reaches near-100%.

**Sparsity traces the U-path (SD 1.5):** down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.

### XIII.2 UNet QK Manifold — The U-Shape

**Self-attention positive eigenvalue fraction through the UNet path:**

| Position | SD 1.5 | SDXL |
|---|---|---|
| down (early) | 0.509 | ~0.49 |
| down (deep) | **0.451** | **0.483** |
| mid (bottleneck) | **0.483** | **0.477** |
| up (early) | 0.501 | 0.501 |
| up (late) | **0.581** | **0.549** |

The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451→0.581 = 0.130 range) because it's more parameter-starved.

**Cross-attention: locked at 0.500 in both UNets.** SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.

### XIII.3 Other UNet Invariants

- Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
- Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
- SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) — extremely concentrated queries to text
- SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) — richest value matrices

---

## XIV. VAE Weight Topology

### XIV.1 Cross-VAE Comparison

| VAE | Params | Latent Ch | Enc (<0.1) | Dec (<0.1) | Enc QK pos | Dec QK pos |
|---|---|---|---|---|---|---|
| SD 1.5 | 83.7M | 4 | 98.6% | 99.1% | 0.496 | 0.486 |
| SDXL | 83.7M | 4 | **29.0%** | **38.1%** | 0.502 | **0.416** |
| Flux.1 | 83.8M | 16 | 96.5% | 97.5% | 0.498 | **0.451** |
| Flux.2 | 84.0M | 32 | 94.3% | 94.3% | **0.393** | **0.416** |

**SDXL VAE is the densest model measured.** 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3× denser. Attention condition numbers reach 1.16M.

### XIV.2 VAE Decoder QK Breaks Toward Repulsion

| VAE | Latent Ch | Decoder QK pos | Interpretation |
|---|---|---|---|
| SD 1.5 | 4 | 0.486 | Slight repulsion |
| SDXL | 4 (1024² target) | **0.416** | Strong repulsion — 4× reconstruction challenge |
| Flux.1 | 16 | **0.451** | Moderate repulsion |
| Flux.2 | 32 | **0.416** | Strong repulsion — most channels to separate |

Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination — more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution → stronger repulsion.

**Flux.1 decoder anomaly:** Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.

### XIV.3 VAE Invariants

- Zero dead neurons across all four VAEs
- Conv filter utilization: 100% (active fraction 1.000)
- Cross-layer conv correlation: ~0.000 — universal, extends to pure convnets
- Spectral correlation between VAEs: 0.94–0.98 — architecture determines SV distribution

---

## XV. Procrustes Analysis — VAE Weight-Space Alignment

### XV.1 Methodology

**Orthogonal Procrustes:** For each common weight matrix (same name, same shape), find orthogonal R minimizing ‖A − BR‖_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.

**Spectral correlation:** Pearson correlation of normalized singular value distributions.

### XV.2 Pairwise Results

| Pair | Raw Cosine | Procrustes Cosine | Rotation Gain | Spectral Corr |
|---|---|---|---|---|
| SD1.5 vs SDXL | 0.053 | 0.697 | +0.644 | 0.958 |
| SD1.5 vs Flux.1 | 0.091 | 0.730 | +0.640 | 0.964 |
| **SD1.5 vs Flux.2** | **-0.000** | **0.757** | **+0.757** | **0.979** |
| SDXL vs Flux.1 | 0.024 | 0.675 | +0.650 | 0.939 |
| SDXL vs Flux.2 | -0.001 | 0.705 | +0.705 | 0.937 |
| Flux.1 vs Flux.2 | 0.000 | 0.736 | +0.736 | 0.957 |

### XV.3 Key Findings

**1. Raw cosine is zero.** All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.

**2. After Procrustes rotation, 70–76% of structure aligns.** These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization → different basis → same function.

**3. Spectral correlation is 0.94–0.98.** Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix — rank structure, energy distribution — is architecture-determined, not training-determined.

**4. SD 1.5 vs Flux.2 is the most alignable pair.** Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.

**5. SDXL is the geometric outlier.** Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.

### XV.4 Distance Matrices

**Procrustes Residual (lower = more similar):**

| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 0.000 | 0.752 | 0.707 | 0.679 |
| SDXL | 0.752 | 0.000 | 0.774 | 0.739 |
| Flux.1 | 0.707 | 0.774 | 0.000 | 0.699 |
| Flux.2 | 0.679 | 0.739 | 0.699 | 0.000 |

**Spectral Correlation (higher = more similar):**

| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 1.000 | 0.958 | 0.964 | 0.979 |
| SDXL | 0.958 | 1.000 | 0.939 | 0.937 |
| Flux.1 | 0.964 | 0.939 | 1.000 | 0.957 |
| Flux.2 | 0.979 | 0.937 | 0.957 | 1.000 |

### XV.5 Implication for Geometric Transfer

A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific — the unique basin each training run found.

---

## XVI. Scripts Reference

| Script | Purpose | Key Outputs |
|---|---|---|
| `probe_t5_small_terrain.py` | T5-Small embedding + layer geometry | PR, CV, digit manifold, layer evolution |
| `probe_t5_wordnet_summarize.py` | T5-Small × WordNet relational alignment | Pearson, Spearman, distance bands, hypernym decay |
| `probe_t5_wordnet_50seeds.py` | 50-seed stability test (GPU-accelerated) | Confidence intervals for all relational metrics |
| `probe_t5_inactive_weights.py` | T5-Small/Base inactive weight topology | SVD, sparsity, QK manifold, dead neurons |
| `cross_architecture_weight_battery.py` | BERT + CLIP + DINOv2 battery | Cross-model comparison table |
| `probe_flux_t5_g4.py` | T5-v1.1-XXL (Flux encoder) full battery | All layers, encoder + decoder + cross-attn |
| `geometric_residual_modulator.py` | LERP modulator + training utilities | Modulator class + measurement tools |
| `geometric_field_modulator.py` | Multi-expert field modulator | KSimplex experts + multiplicative gating |
| `geometric_modulator_full_pipeline.py` | Self-contained T5 + WordNet + modulator | End-to-end pipeline |
| `train_modulator.py` | Training loop for alpha convergence | Freeze T5, train modulator, track alpha |
| `probe_t5gemma2.py` | T5Gemma2 battery (both scales) | GQA handling, adapted enc-dec topology |
| `probe_unet_geometry.py` | SD 1.5 / SDXL UNet battery | U-path QK gradient, cross-attn lock |
| `probe_vae_geometry.py` | All four VAE battery | Conv reshape, bottleneck attention, latent comparison |
| `procrustes_vae_analysis.py` | Pairwise Procrustes on 4 VAEs | Distance matrices, depth profiles, rotation gain |

---

*Last updated: 2026-03-06*
*Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE)*
*Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder)*
*Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction)*
*Procrustes analysis: 6 VAE pairs, 68 weight matrices each*
*Modulator experiments: 4 LERP configurations, 1 field modulator*