File size: 34,705 Bytes
b52a06f 5475e21 f04e79b b52a06f 8ef0f56 b52a06f 8ef0f56 f04e79b 8ef0f56 f04e79b b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f 8ef0f56 f04e79b 8ef0f56 f04e79b b52a06f 8ef0f56 b52a06f 8ef0f56 b52a06f f04e79b 8ef0f56 f04e79b d303fb3 7b7837f 8ef0f56 f04e79b 8ef0f56 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 | ---
license: mit
---
# Day 2
# Geometric Terrain Statistics Composite
## Document Purpose
Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.
---
## I. Models Profiled
| Model | Params | Vocab | Hidden Dim | Layers | Heads | Architecture | Training |
|---|---|---|---|---|---|---|---|
| T5-Small | 60.5M | 32,128 | 512 | 6+6 | 8 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-Base | 222.9M | 32,128 | 768 | 12+12 | 12 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-v1.1-XXL | 11.4B | 32,128 | 4096 | 24+24 | 64 | Enc-Dec (relative PE, **GeGLU** MLP) | C4 (v1.1 variant, no multi-task) |
| BERT-large | 336.2M | 30,522 | 1024 | 24 | 16 | Encoder-only (absolute PE) | BookCorpus+Wikipedia MLM |
| CLIP-ViT-B/16 | 85.5M (visual) | ā | 768 | 12 | 12 | Vision encoder (fused QKV) | LAION-2B contrastive |
| DINOv2-large | 302.0M | ā | 1024 | 24 | 16 | Vision encoder (separate Q/K/V) | Self-supervised (no labels) |
| CLIP-ViT-bigG/14 | 1.84B (visual) | ā | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
| Qwen3.5-0.8B | 853M | 248,320 | 1024 | ā | ā | DeltaNet + MoE + ViT | Multilingual + Vision |
| Qwen3.5-4B | ~4B | 248,320 | 2560 | ā | ā | DeltaNet + MoE + ViT | Multilingual + Vision |
| T5Gemma2-1B-1B | 2.1B | 262,144 | 1152 | 27+26 | GQA 4:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder ā enc-dec |
| T5Gemma2-4B-4B | 7.5B | 262,144 | 2560 | 34+34 | GQA 2:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder ā enc-dec |
| SD 1.5 UNet | 860M | ā | [320,640,1280,1280] | 16 attn blocks | 8 | Conv UNet + self/cross attn | LDM diffusion (LAION) |
| SDXL UNet | 2.6B | ā | [320,640,1280] | 70 attn blocks | [5,10,20] | Conv UNet + self/cross attn | LDM diffusion (internal) |
| SD 1.5 VAE | 83.7M | ā | 4 latent ch | [128,256,512,512] | ā | Conv autoencoder + mid attn | Reconstruction (LAION) |
| SDXL VAE | 83.7M | ā | 4 latent ch | [128,256,512,512] | ā | Conv autoencoder + mid attn | Reconstruction (internal) |
| Flux.1 VAE | 83.8M | ā | 16 latent ch | [128,256,512,512] | ā | Conv autoencoder + mid attn | Reconstruction (BFL) |
| Flux.2 VAE | 84.0M | ā | 32 latent ch | [128,256,512,512] | ā | Conv autoencoder + mid attn | Reconstruction (BFL) |
**Notes:**
- T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
- CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
- T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
- T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
- UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
- VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
- VAE attention exists only at the bottleneck (mid_block) ā one in encoder, one in decoder
---
## II. Embedding Geometry Metrics
### II.1 Participation Ratio (Effective Dimensionality)
**Formula:** PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.
**Process:** Center embeddings (subtract mean), compute covariance C = EįµE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].
| Model | PR | PR / dim | Dims for 95% var |
|---|---|---|---|
| T5-Small (512d) | 287.2 | **0.561** | 379 (74.0%) |
| Qwen3.5-0.8B (1024d) | 547.7 | **0.535** | 893 (87.2%) |
| Qwen3.5-4B (2560d) | 812.4 | **0.317** | 2125 (83.0%) |
**Finding:** PR/dim ā 0.53ā0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.
### II.2 Pairwise Cosine Similarity Distribution
**Formula:** cos(eįµ¢, eā±¼) = (eįµ¢ Ā· eā±¼) / (āeįµ¢ā Ā· āeā±¼ā), sampled over 5K random tokens (12.5M pairs).
**Process:** Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.
| Model | Mean | Std | Median | 1% | 99% |
|---|---|---|---|---|---|
| T5-Small | 0.057 | 0.060 | 0.053 | -0.068 | 0.225 |
| Qwen3.5-0.8B | 0.195 | 0.085 | 0.197 | -0.016 | 0.408 |
| Qwen3.5-4B | 0.142 | 0.078 | 0.139 | -0.029 | 0.356 |
**Finding:** T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).
### II.3 Embedding Norm Distribution
**Formula:** āeįµ¢āā = ā(Ī£eᵢⱼ²)
| Model | Mean Norm | Std | Min | Max |
|---|---|---|---|---|
| T5-Small | 520.15 | 69.84 | 243.31 | 1333.61 |
| Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
| Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |
**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.
---
## III. Simplex Geometry Metrics
### III.1 Pentachoron Volume (Cayley-Menger Determinant)
**Formula:** For 5 points Pā...Pā, construct the bordered distance matrix:
```
D = | 0 1 1 1 1 1 |
| 1 0 dāā² dāā² dāā² dāā²|
| 1 dāā² 0 dāā² dāā² dāā²|
| 1 dāā² dāā² 0 dāā² dāā²|
| 1 dāā² dāā² dāā² 0 dāā²|
| 1 dāā² dāā² dāā² dāā² 0 |
Vol² = (-1)ⵠ· det(D) / (2ⓠ· (4!)²) = -det(D) / 9216
Vol = ā(Vol²) if Vol² > 0, else invalid
```
**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).
| Model | Valid/1000 | CV | Embed/Random Ratio |
|---|---|---|---|
| T5-Small | 1000 | **0.233** | 0.855 |
| Qwen3.5-0.8B | 1000 | **0.208** | 0.984 |
| Qwen3.5-4B | 1000 | **0.222** | 0.988 |
**Finding:** CV 0.20ā0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."
### III.2 Cross-Model Relational Structure
**Formula:** For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.
**Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560ā1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.
| Comparison | Relational Pearson | Pentachoron per-simplex corr |
|---|---|---|
| Qwen 0.8B vs 4B (raw) | 0.920 | 0.89 |
**Finding:** Models at different scales learn the same relational geometry (r=0.92).
---
## IV. Semantic Structure Metrics
### IV.1 Digit Manifold
**Formula:** For digit tokens '0'ā'9', compute all 45 pairwise cosines. Measure Pearson correlation between |iāj| (numerical distance) and cosine similarity.
| Model | |iāj| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
|---|---|---|---|---|
| T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
| Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
| Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |
### IV.2 Semantic Category Clustering (T5-Small)
**Formula:** Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra ā global.
| Category | N tokens | Intra Cosine | Global | Lift |
|---|---|---|---|---|
| numbers | 9 | 0.497 | 0.057 | +0.440 |
| colors | 10 | 0.421 | 0.057 | +0.365 |
| time | 10 | 0.351 | 0.057 | +0.294 |
| food | 10 | 0.248 | 0.057 | +0.191 |
| animals | 12 | 0.241 | 0.057 | +0.184 |
| body | 10 | 0.216 | 0.057 | +0.159 |
| emotions | 10 | 0.197 | 0.057 | +0.141 |
| actions | 9 | 0.183 | 0.057 | +0.126 |
---
## V. Encoder Transformation Metrics (T5-Small)
### V.1 Layer-by-Layer Geometry
**Process:** Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.
| Layer | Mean Norm | Pairwise Cosine |
|---|---|---|
| 0 (embed) | 377.3 | 0.052 |
| 1 | 761.6 | 0.278 |
| 2 | 1092.6 | 0.330 |
| 3 | 1428.8 | 0.367 |
| 4 | 1829.1 | 0.382 |
| 5 | 2378.3 | 0.419 |
| 6 (post-LN) | 3.3 | 0.211 |
**Finding:** Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically ā tokens become MORE similar through depth. The encoder is a convergence funnel.
### V.2 WordNet Relational Alignment
**Process:** Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.
| Representation | Pearson | Spearman |
|---|---|---|
| Static embeddings | 0.078 | 0.015 |
| Encoder output | 0.095 | 0.081 |
**50-seed stability (encoder):** Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.
### V.3 Encoder Distance Bands
| WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
|---|---|---|---|---|
| [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
| [0.25, 0.50) | 53,112 | 0.077 | 0.573 | +0.496 |
| [0.10, 0.25) | 145,035 | 0.060 | 0.565 | +0.505 |
| [0.05, 0.10) | 295,680 | 0.061 | 0.553 | +0.492 |
### V.4 Hypernym Chain Decay
| Depth | Static Cosine | Encoder Cosine |
|---|---|---|
| 1 | 0.160 | 0.656 |
| 3 | 0.075 | 0.594 |
| 5 | 0.069 | 0.585 |
| 7 | 0.068 | 0.579 |
---
## VI. Cross-Architecture Inactive Weight Topology
### VI.1 Q/K/V Sparsity (<0.1 threshold)
**Formula:** Fraction of |wᵢⱼ| < 0.1 across all weights of that type.
**Process:** Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.
| Model | Q | K | V | O | MLP | Full Model |
|---|---|---|---|---|---|---|
| **T5-Small** (512d, 6L) | **93.7%** | 19.2% | 12.1% | 10.4% | 11.9% | 18.4% |
| **T5-Base** (768d, 12L) | **99.4%** | 30.0% | 16.2% | 13.5% | 16.9% | 27.9% |
| **T5-v1.1-XXL** (4096d, 24L) | **100.0%** | **65.5%** | 73.1% | 65.4% | ~57% | ā |
| BERT-large (1024d, 24L) | 99.1% | 99.1% | 99.9% | 99.9% | 99.4% | 99.3% |
| DINOv2-large (1024d, 24L) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| CLIP-ViT-B/16 (768d, 12L) | ā (fused) | ā | ā | ā | 100.0% | 100.0% |
| CLIP-ViT-bigG (1664d, 48L) | ā (fused) | ā | ā | ā | ~97% | 98.0% |
**Key Finding ā T5 Q/K Asymmetry Scales:**
| Model | Q (<0.1) | K (<0.1) | Q/K Ratio |
|---|---|---|---|
| T5-Small | 93.7% | 19.2% | **4.9Ć** |
| T5-Base | 99.4% | 30.0% | **3.3Ć** |
| T5-v1.1-XXL | 100.0% | 65.5% | **1.5Ć** |
T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is **functionally vestigial at scale**.
**T5-v1.1-XXL Encoder vs Decoder:**
| Component | Encoder | Decoder |
|---|---|---|
| self_attn_q | 100.0% | 100.0% |
| self_attn_k | 71.7% | 59.4% |
| self_attn_v | 76.0% | 70.1% |
| cross_attn_q | ā | 100.0% |
| cross_attn_k | ā | 63.1% |
| cross_attn_v | ā | 71.1% |
Q is 100% sparse everywhere ā self-attention and cross-attention, encoder and decoder.
### VI.2 SVD Effective Rank
**Formula:** Stable rank = āWā²_F / āWā²ā = Ī£Ļᵢ² / Ļā². Measures effective rank without thresholding.
| Weight Type | T5-Small | T5-Base | T5-v1.1-XXL | BERT-large | DINOv2-large |
|---|---|---|---|---|---|
| self_attn_q | 47.6 | 58.1 | 96.8 | 50.8 | 57.7 |
| self_attn_k | 53.2 | 62.4 | 90.0 | 37.7 | 55.5 |
| self_attn_v | 75.3 | 97.5 | 204.4 | 113.0 | 94.8 |
| self_attn_o | 25.4 | 35.0 | 16.4 | 125.0 | 85.6 |
| mlp_up/gate | 15.2 | 20.6 | 67.9 (gate) / 247.3 (up) | 27.4 | 58.4 |
| mlp_down | 31.3 | 43.9 | 25.3 | 52.2 | 94.4 |
**T5-v1.1-XXL O matrices have very low stable rank (16.4)** ā the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.
### VI.3 QK Similarity Manifold
**Formula:** QK = W_Q Ā· W_Kįµ. Eigendecompose the symmetric part (QK + QKįµ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.
**Positive Eigenvalue Fraction Trends:**
| Model | First Layer | Last Layer | Trend |
|---|---|---|---|
| T5-Small encoder | 0.615 | 0.535 | **ā0.080** (decreasing) |
| T5-v1.1-XXL encoder | 0.510 | 0.503 | **ā0.007** (flat) |
| T5-v1.1-XXL decoder self | 0.501 | 0.548 | **+0.047** (increasing) |
| **T5-v1.1-XXL cross-attn** | **0.500** | **0.500** | **0.000 (locked)** |
| BERT-large | 0.446 | 0.513 | +0.066 (increasing) |
| CLIP-ViT-B/16 | 0.503 | 0.538 | +0.035 (increasing) |
| DINOv2-large | 0.498 | 0.548 | +0.050 (increasing) |
| CLIP-ViT-bigG | 0.498 | 0.582 | +0.084 (increasing) |
**Critical Finding ā Cross-Attention is Perfectly Balanced:**
T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= ā2) everywhere. This is a locked equilibrium ā the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.
**T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout).** Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.
**BERT starts BELOW 0.50 (0.446).** The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.
### VI.4 MLP Dead Neurons
**Formula:** Combined importance = āwįµ¢_upāā Ā· āwįµ¢_downāā (ReLU) or āwįµ¢_gateāā Ā· āwįµ¢_upāā Ā· āwįµ¢_downāā (GeGLU). Dead if < 1% of mean.
| Model | Dead (<1% mean) | Weak (<10% mean) | Notes |
|---|---|---|---|
| T5-Small (enc+dec) | 0/24,576 (0.00%) | 0/24,576 (0.00%) | All neurons alive |
| T5-Base (enc+dec) | 0/73,728 (0.00%) | 0/73,728 (0.00%) | All neurons alive |
| T5-v1.1-XXL encoder | 0/245,760 (0.00%) | 0/245,760 (0.00%) | All neurons alive |
| T5-v1.1-XXL decoder | **14/245,760 (0.01%)** | **461/245,760 (0.19%)** | First dead neurons in T5 family |
| BERT-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| DINOv2-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| CLIP-ViT-B/16 | **1,316/36,864 (3.57%)** | 1,356/36,864 (3.68%) | Only model with significant dead neurons |
| CLIP-ViT-bigG | 0/393,216 (0.00%) | **24,163/393,216 (6.14%)** | 0 dead but 6% weak |
**Finding:** T5-v1.1-XXL decoder has the first dead neurons in the T5 family ā 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons ā contrastive training at small scale produces genuine pruning.
### VI.5 Cross-Layer Weight Correlation
**Formula:** cos(flatten(Wįµ¢), flatten(Wā±¼)) between weight matrices of the same type at different layers.
| Model | Q adj mean | K adj mean | MLP_up adj mean |
|---|---|---|---|
| T5-Small | ~0.000 | ~0.000 | 0.031ā0.045 |
| T5-Base | ~0.000 | ~0.000 | 0.024ā0.036 |
| T5-v1.1-XXL encoder | 0.0001 | ā | ā |
| T5-v1.1-XXL decoder | ā0.0001 | ā | ā |
| BERT-large | 0.0002 | 0.0003 | 0.032 |
| CLIP-ViT-B/16 | ā0.0004 (QKV) | ā | 0.008 |
| DINOv2-large | ā0.0003 | ā0.0002 | 0.006 |
| CLIP-ViT-bigG | 0.0000 (QKV) | ā | 0.055 |
**Universal finding:** Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance ā feedforward layers share structure.
### VI.6 Position Bias Topology
**T5 uses learned relative position biases:** [32 buckets Ć N_heads].
| Model | Encoder | Decoder |
|---|---|---|
| T5-Small (8 heads) | 3 local, 2 global, 3 mixed | 4 local, 4 global, 0 mixed |
| T5-Base (12 heads) | 4 local, 3 global, 5 mixed | 5 local, 4 global, 3 mixed |
| T5-v1.1-XXL (64 heads) | **24 local, 2 global, 38 mixed** | **27 local, 37 global, 0 mixed** |
**T5-v1.1-XXL position findings:**
- Encoder: 38/64 mixed heads ā nuanced position sensitivity at scale
- **Decoder: ZERO mixed heads** ā perfect binary crystallization. Every head is either pure local or pure global
- Decoder is 58% global (37/64) ā overwhelmingly biased toward long-range attention
- Encoder range: [-47.2, 11.2] ā strong local suppression
- Decoder range: [-28.4, 17.0] ā more balanced
**Finding:** The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.
---
## VII. Geometric Residual Modulator
### VII.1 Architecture
- Geometric embedding: [vocab_size, 64] ā per-token geometric fingerprint
- Projection: Linear(64, d_model, bias=False) ā Procrustes-aligned to encoder PCA space
- Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
- Intervention: residual_out = (1 ā α) Ā· residual + α Ā· proj(geo_embed(token_ids))
- Params: 2.09M (3.45% of T5-Small)
### VII.2 Geometric Embedding Initialization
| Metric | Value |
|---|---|
| WN reconstruction correlation | 0.921 |
| Procrustes alignment cosine | 0.372 |
| Eigenvalue cumulative (top 64) | 61.3% |
### VII.3 Alpha Convergence
| Start α | Final Mean α | Layer 5 Final | Pearson Π| CV | Coherent | Basin |
|---|---|---|---|---|---|---|
| 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
| 0.20 (20 ep) | 0.222 | 0.308 | +0.085 | 0.452 | No | Ridge |
| 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
| 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |
### VII.4 Depth Gradient (Consistent Across All Runs)
| Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
|---|---|---|---|
| 0 | 0.015 | 0.035 | 0.170 |
| 1 | 0.052 | 0.061 | 0.180 |
| 2 | 0.066 | 0.102 | 0.227 |
| 3 | 0.080 | 0.137 | 0.197 |
| 4 | 0.080 | 0.197 | 0.248 |
| 5 | 0.107 | 0.218 | 0.308 |
**Finding:** Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.
### VII.5 Best Result
| Metric | Original | Modulated (20ep, α=0.01 start) | Change |
|---|---|---|---|
| WordNet Pearson | 0.099 | **0.250** | **+152%** |
| WordNet Spearman | 0.085 | **0.245** | **+189%** |
| Semantic Gradient | 0.022 | **0.052** | **+132%** |
| Pentachoron CV | 0.202 | **0.220** | Stayed in band |
| Per-token Preservation | ā | 0.730 | ā |
| Coherence | Baseline | **Identical on 4/4 tests** | ā |
---
## VIII. Geometric Field Modulator (Multi-Expert)
### VIII.1 Architecture
- Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
- **Multiplicative gating**: residual Ć Ī (blended_gates) ā valid regions pass, invalid suppressed
- **Soft blending**: per expert gate = (1 ā α) + α Ć expert_gate
- **Null space**: 25% of residual dimensions untouched by modulator
- **Alpha clamped**: [0.001, 0.35] ā hard ceiling below the phase boundary
- **Gradient scaling**: geometric params at 10% LR, alpha at 50% LR, gates at full LR
- Params: **38,552** (0.064% of T5-Small)
- Self-test: validity=0.985, null space preserved, template volumes sane
### VIII.2 Design Rationale (Grounded in Cross-Architecture Data)
| Data Point | Design Decision |
|---|---|
| Q sparsity 100% at scale | Geometric field can replace Q ā the model barely uses it |
| Cross-attn QK locked at 0.500 | Target equilibrium for geometric validity gating |
| Depth gradient always increasing | Per-layer alpha respects this (low early, high late) |
| Zero dead MLP neurons | Don't touch MLPs ā all capacity is in use |
| Decoder position: binary L/G split | Modulator preserves positional structure (null space) |
| CV 0.20ā0.23 universal | CV monitoring as health check, not loss |
---
## IX. The 0.29154 Constant
### IX.1 Observations Across Systems
| System | Context | Value |
|---|---|---|
| MinimalShunts | CLIP-L ā CLIP-G projection gate | Emergent equilibrium |
| Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
| Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
| T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291ā0.292, semantic phase transition |
| Alpha training basins | 0.70 start ā settled at 0.695 | Mirror constant 1 ā 0.29154 = 0.70846, Ī = 0.013 |
### IX.2 T5 Generation Phase Transition
| Alpha | Output (triangle prompt) |
|---|---|
| 0.01ā0.10 | "...three edges and three vertices. it is one of the basic shapes in geometry." |
| 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
| 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
| 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.2915 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.292 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **the world**." |
| 0.30 | "a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |
**Finding:** 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.
---
## X. Universal Geometric Constants
| Constant | Value | Observed In |
|---|---|---|
| Pentachoron CV | 0.20ā0.23 | T5-Small, Qwen 0.8B, Qwen 4B, trained modulator |
| Participation / dim | 0.53ā0.56 | T5-Small, Qwen 0.8B |
| Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
| Depth gradient | Monotonic increasing | All modulator training runs |
| Q sparsity scaling (T5) | 93.7% ā 99.4% ā 100.0% | T5-Small ā T5-Base ā T5-v1.1-XXL |
| Q sparsity asymmetry | **T5 pretraining only** | Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs |
| Cross-modal QK balance | **Locked at 0.500** | T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models) |
| Self-attn QK: adapted models | **Locked at 0.500** | T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers) |
| UNet QK U-gradient | downārepulsion, upāattraction | SD 1.5 (0.451ā0.581), SDXL (0.477ā0.549) |
| VAE decoder QK | Repulsion-biased | SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416) |
| Attention cross-layer corr | ~0.000 | ALL 17 models, including UNets and VAEs |
| Conv cross-layer corr | ~0.000 | All UNets and VAEs (extends to pure convnets) |
| MLP/FF full utilization | 0.00% dead | T5 family (enc), BERT, DINOv2, UNets, all VAEs |
| Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
| VAE spectral invariant | Pearson 0.94ā0.98 | All 6 VAE pairs ā SV distribution is architecture-determined |
| VAE Procrustes alignment | 70ā76% cosine | All 6 pairs ā same solution in different coordinate systems |
---
## XI. Measurement Toolkit Reference
| Tool | Input | Output | Requires Inference |
|---|---|---|---|
| Participation Ratio | Embedding matrix | Effective dimensionality | No |
| Cayley-Menger Volume | 5-point subsets of embeddings | Simplex volume + CV | No |
| Pairwise Cosine | Embedding matrix (sampled) | Similarity distribution | No |
| Digit Manifold | 10 digit token embeddings | |iāj| correlation, adjacency gap | No |
| SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
| QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
| Dead Neuron Count | MLP wi/gate/up, wo matrices | Combined importance distribution | No |
| Cross-Layer Correlation | Same-type weight matrices | Adjacent cosine similarity | No |
| Position Bias Topology | Relative attention bias tensor | Local/global/mixed head counts | No |
| Sparsity Topology | Any weight matrix | Fraction below threshold | No |
| WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
| Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |
---
## XII. T5Gemma2 ā Decoder-Adapted Encoder-Decoder
**Architecture:** Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).
### XII.1 Sparsity
| Model | Q (<0.1) | K (<0.1) | V (<0.1) | Pattern |
|---|---|---|---|---|
| T5Gemma2 1B-1B | 100.0% | 99.9% | 100.0% | **Uniform** |
| T5Gemma2 4B-4B | 100.0% | 100.0% | 100.0% | **Uniform** |
**Finding:** No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.
### XII.2 QK Manifold
| Model | Encoder Self | Decoder Self | All Layers |
|---|---|---|---|
| T5Gemma2 1B | 0.500 (±0.001) | 0.500 (±0.001) | **Locked** |
| T5Gemma2 4B | 0.500 exact | 0.500 exact | **Locked** |
**Finding:** Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation ā2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.
### XII.3 Other Invariants
- Dead neurons: 0/359,424 (1B), 0/696,320 (4B) ā all alive
- Cross-layer Q correlation: ~0.000 ā confirmed universal
- MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
- GQA: 4:1 at 1B scale, 2:1 at 4B scale
---
## XIII. Diffusion UNet Weight Topology
### XIII.1 UNet Sparsity
| Model | Self Q | Self K | Self V | Cross Q | Cross K | Cross V |
|---|---|---|---|---|---|---|
| SD 1.5 UNet | **90.5%** | **90.9%** | 97.1% | 96.8% | 94.9% | 98.9% |
| SDXL UNet | 99.9% | 99.9% | 100.0% | 100.0% | 100.0% | 100.0% |
**SD 1.5 is the least sparse model in the entire battery.** 90.5% for self-attention Q ā below T5-Small's 93.7%. A parameter-starved model (860M for 512Ć512 image generation) uses denser weights. SDXL at 3Ć the params reaches near-100%.
**Sparsity traces the U-path (SD 1.5):** down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.
### XIII.2 UNet QK Manifold ā The U-Shape
**Self-attention positive eigenvalue fraction through the UNet path:**
| Position | SD 1.5 | SDXL |
|---|---|---|
| down (early) | 0.509 | ~0.49 |
| down (deep) | **0.451** | **0.483** |
| mid (bottleneck) | **0.483** | **0.477** |
| up (early) | 0.501 | 0.501 |
| up (late) | **0.581** | **0.549** |
The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451ā0.581 = 0.130 range) because it's more parameter-starved.
**Cross-attention: locked at 0.500 in both UNets.** SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.
### XIII.3 Other UNet Invariants
- Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
- Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
- SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) ā extremely concentrated queries to text
- SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) ā richest value matrices
---
## XIV. VAE Weight Topology
### XIV.1 Cross-VAE Comparison
| VAE | Params | Latent Ch | Enc (<0.1) | Dec (<0.1) | Enc QK pos | Dec QK pos |
|---|---|---|---|---|---|---|
| SD 1.5 | 83.7M | 4 | 98.6% | 99.1% | 0.496 | 0.486 |
| SDXL | 83.7M | 4 | **29.0%** | **38.1%** | 0.502 | **0.416** |
| Flux.1 | 83.8M | 16 | 96.5% | 97.5% | 0.498 | **0.451** |
| Flux.2 | 84.0M | 32 | 94.3% | 94.3% | **0.393** | **0.416** |
**SDXL VAE is the densest model measured.** 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3Ć denser. Attention condition numbers reach 1.16M.
### XIV.2 VAE Decoder QK Breaks Toward Repulsion
| VAE | Latent Ch | Decoder QK pos | Interpretation |
|---|---|---|---|
| SD 1.5 | 4 | 0.486 | Slight repulsion |
| SDXL | 4 (1024² target) | **0.416** | Strong repulsion ā 4Ć reconstruction challenge |
| Flux.1 | 16 | **0.451** | Moderate repulsion |
| Flux.2 | 32 | **0.416** | Strong repulsion ā most channels to separate |
Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination ā more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution ā stronger repulsion.
**Flux.1 decoder anomaly:** Top eigenvalue = 60,807 (typical is 2ā150). One attention direction completely dominates. Rank-1 approximation of the attention space.
### XIV.3 VAE Invariants
- Zero dead neurons across all four VAEs
- Conv filter utilization: 100% (active fraction 1.000)
- Cross-layer conv correlation: ~0.000 ā universal, extends to pure convnets
- Spectral correlation between VAEs: 0.94ā0.98 ā architecture determines SV distribution
---
## XV. Procrustes Analysis ā VAE Weight-Space Alignment
### XV.1 Methodology
**Orthogonal Procrustes:** For each common weight matrix (same name, same shape), find orthogonal R minimizing āA ā BRā_F via SVD of B^TA. Report residual (0 = identical up to rotation, ā2 = orthogonal) and cosine after alignment.
**Spectral correlation:** Pearson correlation of normalized singular value distributions.
### XV.2 Pairwise Results
| Pair | Raw Cosine | Procrustes Cosine | Rotation Gain | Spectral Corr |
|---|---|---|---|---|
| SD1.5 vs SDXL | 0.053 | 0.697 | +0.644 | 0.958 |
| SD1.5 vs Flux.1 | 0.091 | 0.730 | +0.640 | 0.964 |
| **SD1.5 vs Flux.2** | **-0.000** | **0.757** | **+0.757** | **0.979** |
| SDXL vs Flux.1 | 0.024 | 0.675 | +0.650 | 0.939 |
| SDXL vs Flux.2 | -0.001 | 0.705 | +0.705 | 0.937 |
| Flux.1 vs Flux.2 | 0.000 | 0.736 | +0.736 | 0.957 |
### XV.3 Key Findings
**1. Raw cosine is zero.** All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.
**2. After Procrustes rotation, 70ā76% of structure aligns.** These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization ā different basis ā same function.
**3. Spectral correlation is 0.94ā0.98.** Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix ā rank structure, energy distribution ā is architecture-determined, not training-determined.
**4. SD 1.5 vs Flux.2 is the most alignable pair.** Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.
**5. SDXL is the geometric outlier.** Lowest Procrustes cosine with every model (0.675ā0.705). Found a more distant basin despite identical architecture to SD 1.5.
### XV.4 Distance Matrices
**Procrustes Residual (lower = more similar):**
| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 0.000 | 0.752 | 0.707 | 0.679 |
| SDXL | 0.752 | 0.000 | 0.774 | 0.739 |
| Flux.1 | 0.707 | 0.774 | 0.000 | 0.699 |
| Flux.2 | 0.679 | 0.739 | 0.699 | 0.000 |
**Spectral Correlation (higher = more similar):**
| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 1.000 | 0.958 | 0.964 | 0.979 |
| SDXL | 0.958 | 1.000 | 0.939 | 0.937 |
| Flux.1 | 0.964 | 0.939 | 1.000 | 0.957 |
| Flux.2 | 0.979 | 0.937 | 0.957 | 1.000 |
### XV.5 Implication for Geometric Transfer
A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70ā76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24ā30% is model-specific ā the unique basin each training run found.
---
## XVI. Scripts Reference
| Script | Purpose | Key Outputs |
|---|---|---|
| `probe_t5_small_terrain.py` | T5-Small embedding + layer geometry | PR, CV, digit manifold, layer evolution |
| `probe_t5_wordnet_summarize.py` | T5-Small Ć WordNet relational alignment | Pearson, Spearman, distance bands, hypernym decay |
| `probe_t5_wordnet_50seeds.py` | 50-seed stability test (GPU-accelerated) | Confidence intervals for all relational metrics |
| `probe_t5_inactive_weights.py` | T5-Small/Base inactive weight topology | SVD, sparsity, QK manifold, dead neurons |
| `cross_architecture_weight_battery.py` | BERT + CLIP + DINOv2 battery | Cross-model comparison table |
| `probe_flux_t5_g4.py` | T5-v1.1-XXL (Flux encoder) full battery | All layers, encoder + decoder + cross-attn |
| `geometric_residual_modulator.py` | LERP modulator + training utilities | Modulator class + measurement tools |
| `geometric_field_modulator.py` | Multi-expert field modulator | KSimplex experts + multiplicative gating |
| `geometric_modulator_full_pipeline.py` | Self-contained T5 + WordNet + modulator | End-to-end pipeline |
| `train_modulator.py` | Training loop for alpha convergence | Freeze T5, train modulator, track alpha |
| `probe_t5gemma2.py` | T5Gemma2 battery (both scales) | GQA handling, adapted enc-dec topology |
| `probe_unet_geometry.py` | SD 1.5 / SDXL UNet battery | U-path QK gradient, cross-attn lock |
| `probe_vae_geometry.py` | All four VAE battery | Conv reshape, bottleneck attention, latent comparison |
| `procrustes_vae_analysis.py` | Pairwise Procrustes on 4 VAEs | Distance matrices, depth profiles, rotation gain |
---
*Last updated: 2026-03-06*
*Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE)*
*Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder)*
*Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction)*
*Procrustes analysis: 6 VAE pairs, 68 weight matrices each*
*Modulator experiments: 4 LERP configurations, 1 field modulator* |