File size: 34,705 Bytes
b52a06f
 
 
 
5475e21
f04e79b
b52a06f
 
 
8ef0f56
b52a06f
 
 
 
 
8ef0f56
 
 
 
 
 
 
 
 
 
 
f04e79b
 
 
 
 
 
 
 
8ef0f56
 
 
 
 
f04e79b
 
 
 
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
8ef0f56
b52a06f
8ef0f56
b52a06f
8ef0f56
b52a06f
8ef0f56
 
 
 
 
 
 
 
 
b52a06f
8ef0f56
 
 
 
 
 
 
b52a06f
8ef0f56
b52a06f
8ef0f56
 
 
b52a06f
8ef0f56
 
 
 
 
 
 
 
b52a06f
8ef0f56
 
 
 
 
 
 
 
 
 
 
 
 
 
b52a06f
 
 
 
 
8ef0f56
b52a06f
8ef0f56
 
 
 
 
 
 
 
 
 
b52a06f
8ef0f56
 
 
 
 
 
 
b52a06f
 
 
8ef0f56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b52a06f
8ef0f56
b52a06f
8ef0f56
 
 
 
 
 
 
 
 
 
 
 
b52a06f
8ef0f56
b52a06f
8ef0f56
 
 
 
 
 
 
b52a06f
8ef0f56
 
 
 
 
 
b52a06f
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
8ef0f56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b52a06f
 
 
 
 
 
 
8ef0f56
b52a06f
8ef0f56
b52a06f
 
 
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
 
 
8ef0f56
b52a06f
 
 
 
 
 
 
8ef0f56
f04e79b
 
 
 
 
 
 
 
8ef0f56
f04e79b
 
b52a06f
 
 
8ef0f56
b52a06f
 
 
 
 
 
 
 
 
8ef0f56
 
 
 
b52a06f
 
 
 
 
f04e79b
8ef0f56
f04e79b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d303fb3
7b7837f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8ef0f56
 
 
 
f04e79b
 
 
 
8ef0f56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
---
license: mit
---

# Day 2
# Geometric Terrain Statistics Composite

## Document Purpose

Running catalog of geometric measurements across language and vision models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.

---

## I. Models Profiled

| Model | Params | Vocab | Hidden Dim | Layers | Heads | Architecture | Training |
|---|---|---|---|---|---|---|---|
| T5-Small | 60.5M | 32,128 | 512 | 6+6 | 8 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-Base | 222.9M | 32,128 | 768 | 12+12 | 12 | Enc-Dec (relative PE, ReLU MLP) | C4 span corruption |
| T5-v1.1-XXL | 11.4B | 32,128 | 4096 | 24+24 | 64 | Enc-Dec (relative PE, **GeGLU** MLP) | C4 (v1.1 variant, no multi-task) |
| BERT-large | 336.2M | 30,522 | 1024 | 24 | 16 | Encoder-only (absolute PE) | BookCorpus+Wikipedia MLM |
| CLIP-ViT-B/16 | 85.5M (visual) | — | 768 | 12 | 12 | Vision encoder (fused QKV) | LAION-2B contrastive |
| DINOv2-large | 302.0M | — | 1024 | 24 | 16 | Vision encoder (separate Q/K/V) | Self-supervised (no labels) |
| CLIP-ViT-bigG/14 | 1.84B (visual) | — | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
| Qwen3.5-0.8B | 853M | 248,320 | 1024 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
| Qwen3.5-4B | ~4B | 248,320 | 2560 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
| T5Gemma2-1B-1B | 2.1B | 262,144 | 1152 | 27+26 | GQA 4:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
| T5Gemma2-4B-4B | 7.5B | 262,144 | 2560 | 34+34 | GQA 2:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
| SD 1.5 UNet | 860M | — | [320,640,1280,1280] | 16 attn blocks | 8 | Conv UNet + self/cross attn | LDM diffusion (LAION) |
| SDXL UNet | 2.6B | — | [320,640,1280] | 70 attn blocks | [5,10,20] | Conv UNet + self/cross attn | LDM diffusion (internal) |
| SD 1.5 VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (LAION) |
| SDXL VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (internal) |
| Flux.1 VAE | 83.8M | — | 16 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |
| Flux.2 VAE | 84.0M | — | 32 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |

**Notes:**
- T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
- CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
- T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
- T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
- UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
- VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
- VAE attention exists only at the bottleneck (mid_block) — one in encoder, one in decoder

---

## II. Embedding Geometry Metrics

### II.1 Participation Ratio (Effective Dimensionality)

**Formula:** PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.

**Process:** Center embeddings (subtract mean), compute covariance C = Eįµ€E / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].

| Model | PR | PR / dim | Dims for 95% var |
|---|---|---|---|
| T5-Small (512d) | 287.2 | **0.561** | 379 (74.0%) |
| Qwen3.5-0.8B (1024d) | 547.7 | **0.535** | 893 (87.2%) |
| Qwen3.5-4B (2560d) | 812.4 | **0.317** | 2125 (83.0%) |

**Finding:** PR/dim ā‰ˆ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.

### II.2 Pairwise Cosine Similarity Distribution

**Formula:** cos(eįµ¢, eā±¼) = (eįµ¢ Ā· eā±¼) / (‖eᵢ‖ Ā· ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).

**Process:** Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.

| Model | Mean | Std | Median | 1% | 99% |
|---|---|---|---|---|---|
| T5-Small | 0.057 | 0.060 | 0.053 | -0.068 | 0.225 |
| Qwen3.5-0.8B | 0.195 | 0.085 | 0.197 | -0.016 | 0.408 |
| Qwen3.5-4B | 0.142 | 0.078 | 0.139 | -0.029 | 0.356 |

**Finding:** T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).

### II.3 Embedding Norm Distribution

**Formula:** ‖eᵢ‖₂ = √(Ī£eᵢⱼ²)

| Model | Mean Norm | Std | Min | Max |
|---|---|---|---|---|
| T5-Small | 520.15 | 69.84 | 243.31 | 1333.61 |
| Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
| Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |

**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm.

---

## III. Simplex Geometry Metrics

### III.1 Pentachoron Volume (Cayley-Menger Determinant)

**Formula:** For 5 points Pā‚€...Pā‚„, construct the bordered distance matrix:

```
D = | 0  1    1    1    1    1   |
    | 1  0    d₀₁² d₀₂² dā‚€ā‚ƒĀ² d₀₄²|
    | 1  d₁₀² 0    d₁₂² dā‚ā‚ƒĀ² d₁₄²|
    | 1  d₂₀² d₂₁² 0    dā‚‚ā‚ƒĀ² d₂₄²|
    | 1  dā‚ƒā‚€Ā² dā‚ƒā‚Ā² dā‚ƒā‚‚Ā² 0    dā‚ƒā‚„Ā²|
    | 1  d₄₀² d₄₁² d₄₂² dā‚„ā‚ƒĀ² 0   |

Vol² = (-1)⁵ · det(D) / (2⁓ · (4!)²) = -det(D) / 9216
Vol = √(Vol²) if Vol² > 0, else invalid
```

**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Report CV (coefficient of variation = std/mean).

| Model | Valid/1000 | CV | Embed/Random Ratio |
|---|---|---|---|
| T5-Small | 1000 | **0.233** | 0.855 |
| Qwen3.5-0.8B | 1000 | **0.208** | 0.984 |
| Qwen3.5-4B | 1000 | **0.222** | 0.988 |

**Finding:** CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."

### III.2 Cross-Model Relational Structure

**Formula:** For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.

**Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.

| Comparison | Relational Pearson | Pentachoron per-simplex corr |
|---|---|---|
| Qwen 0.8B vs 4B (raw) | 0.920 | 0.89 |

**Finding:** Models at different scales learn the same relational geometry (r=0.92).

---

## IV. Semantic Structure Metrics

### IV.1 Digit Manifold

**Formula:** For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |iāˆ’j| (numerical distance) and cosine similarity.

| Model | |iāˆ’j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
|---|---|---|---|---|
| T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
| Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
| Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |

### IV.2 Semantic Category Clustering (T5-Small)

**Formula:** Mean intra-category pairwise cosine vs global mean pairwise cosine. Lift = intra āˆ’ global.

| Category | N tokens | Intra Cosine | Global | Lift |
|---|---|---|---|---|
| numbers | 9 | 0.497 | 0.057 | +0.440 |
| colors | 10 | 0.421 | 0.057 | +0.365 |
| time | 10 | 0.351 | 0.057 | +0.294 |
| food | 10 | 0.248 | 0.057 | +0.191 |
| animals | 12 | 0.241 | 0.057 | +0.184 |
| body | 10 | 0.216 | 0.057 | +0.159 |
| emotions | 10 | 0.197 | 0.057 | +0.141 |
| actions | 9 | 0.183 | 0.057 | +0.126 |

---

## V. Encoder Transformation Metrics (T5-Small)

### V.1 Layer-by-Layer Geometry

**Process:** Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.

| Layer | Mean Norm | Pairwise Cosine |
|---|---|---|
| 0 (embed) | 377.3 | 0.052 |
| 1 | 761.6 | 0.278 |
| 2 | 1092.6 | 0.330 |
| 3 | 1428.8 | 0.367 |
| 4 | 1829.1 | 0.382 |
| 5 | 2378.3 | 0.419 |
| 6 (post-LN) | 3.3 | 0.211 |

**Finding:** Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.

### V.2 WordNet Relational Alignment

**Process:** Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.

| Representation | Pearson | Spearman |
|---|---|---|
| Static embeddings | 0.078 | 0.015 |
| Encoder output | 0.095 | 0.081 |

**50-seed stability (encoder):** Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.

### V.3 Encoder Distance Bands

| WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
|---|---|---|---|---|
| [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
| [0.25, 0.50) | 53,112 | 0.077 | 0.573 | +0.496 |
| [0.10, 0.25) | 145,035 | 0.060 | 0.565 | +0.505 |
| [0.05, 0.10) | 295,680 | 0.061 | 0.553 | +0.492 |

### V.4 Hypernym Chain Decay

| Depth | Static Cosine | Encoder Cosine |
|---|---|---|
| 1 | 0.160 | 0.656 |
| 3 | 0.075 | 0.594 |
| 5 | 0.069 | 0.585 |
| 7 | 0.068 | 0.579 |

---

## VI. Cross-Architecture Inactive Weight Topology

### VI.1 Q/K/V Sparsity (<0.1 threshold)

**Formula:** Fraction of |wᵢⱼ| < 0.1 across all weights of that type.

**Process:** Iterate all 2D weight matrices, compute abs values, count below threshold. No inference needed.

| Model | Q | K | V | O | MLP | Full Model |
|---|---|---|---|---|---|---|
| **T5-Small** (512d, 6L) | **93.7%** | 19.2% | 12.1% | 10.4% | 11.9% | 18.4% |
| **T5-Base** (768d, 12L) | **99.4%** | 30.0% | 16.2% | 13.5% | 16.9% | 27.9% |
| **T5-v1.1-XXL** (4096d, 24L) | **100.0%** | **65.5%** | 73.1% | 65.4% | ~57% | — |
| BERT-large (1024d, 24L) | 99.1% | 99.1% | 99.9% | 99.9% | 99.4% | 99.3% |
| DINOv2-large (1024d, 24L) | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% | 100.0% |
| CLIP-ViT-B/16 (768d, 12L) | — (fused) | — | — | — | 100.0% | 100.0% |
| CLIP-ViT-bigG (1664d, 48L) | — (fused) | — | — | — | ~97% | 98.0% |

**Key Finding — T5 Q/K Asymmetry Scales:**

| Model | Q (<0.1) | K (<0.1) | Q/K Ratio |
|---|---|---|---|
| T5-Small | 93.7% | 19.2% | **4.9Ɨ** |
| T5-Base | 99.4% | 30.0% | **3.3Ɨ** |
| T5-v1.1-XXL | 100.0% | 65.5% | **1.5Ɨ** |

T5 has a genuine Q-specific sparsity that scales with model size. Q hit 100.0% at XXL (every single weight below 0.1). This is NOT the BERT/DINOv2 pattern where all weight types are uniformly sparse. The query projection in T5 is **functionally vestigial at scale**.

**T5-v1.1-XXL Encoder vs Decoder:**

| Component | Encoder | Decoder |
|---|---|---|
| self_attn_q | 100.0% | 100.0% |
| self_attn_k | 71.7% | 59.4% |
| self_attn_v | 76.0% | 70.1% |
| cross_attn_q | — | 100.0% |
| cross_attn_k | — | 63.1% |
| cross_attn_v | — | 71.1% |

Q is 100% sparse everywhere — self-attention and cross-attention, encoder and decoder.

### VI.2 SVD Effective Rank

**Formula:** Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / Ļƒā‚Ā². Measures effective rank without thresholding.

| Weight Type | T5-Small | T5-Base | T5-v1.1-XXL | BERT-large | DINOv2-large |
|---|---|---|---|---|---|
| self_attn_q | 47.6 | 58.1 | 96.8 | 50.8 | 57.7 |
| self_attn_k | 53.2 | 62.4 | 90.0 | 37.7 | 55.5 |
| self_attn_v | 75.3 | 97.5 | 204.4 | 113.0 | 94.8 |
| self_attn_o | 25.4 | 35.0 | 16.4 | 125.0 | 85.6 |
| mlp_up/gate | 15.2 | 20.6 | 67.9 (gate) / 247.3 (up) | 27.4 | 58.4 |
| mlp_down | 31.3 | 43.9 | 25.3 | 52.2 | 94.4 |

**T5-v1.1-XXL O matrices have very low stable rank (16.4)** — the output projection is extremely low-rank despite the 4096-d space. Cross-attention O is even lower at 6.1.

### VI.3 QK Similarity Manifold

**Formula:** QK = W_Q Ā· W_Kįµ€. Eigendecompose the symmetric part (QK + QKįµ€)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.

**Positive Eigenvalue Fraction Trends:**

| Model | First Layer | Last Layer | Trend |
|---|---|---|---|
| T5-Small encoder | 0.615 | 0.535 | **āˆ’0.080** (decreasing) |
| T5-v1.1-XXL encoder | 0.510 | 0.503 | **āˆ’0.007** (flat) |
| T5-v1.1-XXL decoder self | 0.501 | 0.548 | **+0.047** (increasing) |
| **T5-v1.1-XXL cross-attn** | **0.500** | **0.500** | **0.000 (locked)** |
| BERT-large | 0.446 | 0.513 | +0.066 (increasing) |
| CLIP-ViT-B/16 | 0.503 | 0.538 | +0.035 (increasing) |
| DINOv2-large | 0.498 | 0.548 | +0.050 (increasing) |
| CLIP-ViT-bigG | 0.498 | 0.582 | +0.084 (increasing) |

**Critical Finding — Cross-Attention is Perfectly Balanced:**

T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negative at ALL 24 layers. Symmetry deviation is 1.414 (= √2) everywhere. This is a locked equilibrium — the bridge between encoder and decoder maintains perfect balance between attraction and repulsion at every depth. No other attention type shows this level of stability.

**T5-v1.1-XXL encoder self-attention is flat (~0.50 throughout).** Unlike T5-Small which decreased from 0.615 to 0.535, the XXL encoder stays near the equilibrium point. The larger model doesn't need to build anti-similarity boundaries because it has enough capacity to discriminate through other mechanisms.

**BERT starts BELOW 0.50 (0.446).** The only model with majority-repulsion from layer 0. MLM bidirectional training creates fundamentally different QK geometry from autoregressive or contrastive training.

### VI.4 MLP Dead Neurons

**Formula:** Combined importance = ‖wįµ¢_up‖₂ Ā· ‖wįµ¢_down‖₂ (ReLU) or ‖wįµ¢_gate‖₂ Ā· ‖wįµ¢_up‖₂ Ā· ‖wįµ¢_down‖₂ (GeGLU). Dead if < 1% of mean.

| Model | Dead (<1% mean) | Weak (<10% mean) | Notes |
|---|---|---|---|
| T5-Small (enc+dec) | 0/24,576 (0.00%) | 0/24,576 (0.00%) | All neurons alive |
| T5-Base (enc+dec) | 0/73,728 (0.00%) | 0/73,728 (0.00%) | All neurons alive |
| T5-v1.1-XXL encoder | 0/245,760 (0.00%) | 0/245,760 (0.00%) | All neurons alive |
| T5-v1.1-XXL decoder | **14/245,760 (0.01%)** | **461/245,760 (0.19%)** | First dead neurons in T5 family |
| BERT-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| DINOv2-large | 0/98,304 (0.00%) | 0/98,304 (0.00%) | All neurons alive |
| CLIP-ViT-B/16 | **1,316/36,864 (3.57%)** | 1,356/36,864 (3.68%) | Only model with significant dead neurons |
| CLIP-ViT-bigG | 0/393,216 (0.00%) | **24,163/393,216 (6.14%)** | 0 dead but 6% weak |

**Finding:** T5-v1.1-XXL decoder has the first dead neurons in the T5 family — 14 neurons in layers 1-2 only. The decoder's early GeGLU layers carved out a tiny amount of capacity. Encoder uses everything. CLIP-ViT-B/16 is the outlier with 3.6% dead neurons — contrastive training at small scale produces genuine pruning.

### VI.5 Cross-Layer Weight Correlation

**Formula:** cos(flatten(Wįµ¢), flatten(Wā±¼)) between weight matrices of the same type at different layers.

| Model | Q adj mean | K adj mean | MLP_up adj mean |
|---|---|---|---|
| T5-Small | ~0.000 | ~0.000 | 0.031–0.045 |
| T5-Base | ~0.000 | ~0.000 | 0.024–0.036 |
| T5-v1.1-XXL encoder | 0.0001 | — | — |
| T5-v1.1-XXL decoder | āˆ’0.0001 | — | — |
| BERT-large | 0.0002 | 0.0003 | 0.032 |
| CLIP-ViT-B/16 | āˆ’0.0004 (QKV) | — | 0.008 |
| DINOv2-large | āˆ’0.0003 | āˆ’0.0002 | 0.006 |
| CLIP-ViT-bigG | 0.0000 (QKV) | — | 0.055 |

**Universal finding:** Attention weights (Q, K, V) are completely uncorrelated across layers (~0.000). Every layer defines an independent similarity function. MLP weights show positive correlation decaying with distance — feedforward layers share structure.

### VI.6 Position Bias Topology

**T5 uses learned relative position biases:** [32 buckets Ɨ N_heads].

| Model | Encoder | Decoder |
|---|---|---|
| T5-Small (8 heads) | 3 local, 2 global, 3 mixed | 4 local, 4 global, 0 mixed |
| T5-Base (12 heads) | 4 local, 3 global, 5 mixed | 5 local, 4 global, 3 mixed |
| T5-v1.1-XXL (64 heads) | **24 local, 2 global, 38 mixed** | **27 local, 37 global, 0 mixed** |

**T5-v1.1-XXL position findings:**
- Encoder: 38/64 mixed heads — nuanced position sensitivity at scale
- **Decoder: ZERO mixed heads** — perfect binary crystallization. Every head is either pure local or pure global
- Decoder is 58% global (37/64) — overwhelmingly biased toward long-range attention
- Encoder range: [-47.2, 11.2] — strong local suppression
- Decoder range: [-28.4, 17.0] — more balanced

**Finding:** The decoder local/global binary split is scale-invariant (0 mixed at T5-Small, 0 mixed at XXL). Gradient descent crystallizes decoder position heads into two pure modes regardless of capacity.

---

## VII. Geometric Residual Modulator

### VII.1 Architecture

- Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
- Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
- Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
- Intervention: residual_out = (1 āˆ’ α) Ā· residual + α Ā· proj(geo_embed(token_ids))
- Params: 2.09M (3.45% of T5-Small)

### VII.2 Geometric Embedding Initialization

| Metric | Value |
|---|---|
| WN reconstruction correlation | 0.921 |
| Procrustes alignment cosine | 0.372 |
| Eigenvalue cumulative (top 64) | 61.3% |

### VII.3 Alpha Convergence

| Start α | Final Mean α | Layer 5 Final | Pearson Ī” | CV | Coherent | Basin |
|---|---|---|---|---|---|---|
| 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
| 0.20 (20 ep) | 0.222 | 0.308 | +0.085 | 0.452 | No | Ridge |
| 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
| 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |

### VII.4 Depth Gradient (Consistent Across All Runs)

| Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
|---|---|---|---|
| 0 | 0.015 | 0.035 | 0.170 |
| 1 | 0.052 | 0.061 | 0.180 |
| 2 | 0.066 | 0.102 | 0.227 |
| 3 | 0.080 | 0.137 | 0.197 |
| 4 | 0.080 | 0.197 | 0.248 |
| 5 | 0.107 | 0.218 | 0.308 |

**Finding:** Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.

### VII.5 Best Result

| Metric | Original | Modulated (20ep, α=0.01 start) | Change |
|---|---|---|---|
| WordNet Pearson | 0.099 | **0.250** | **+152%** |
| WordNet Spearman | 0.085 | **0.245** | **+189%** |
| Semantic Gradient | 0.022 | **0.052** | **+132%** |
| Pentachoron CV | 0.202 | **0.220** | Stayed in band |
| Per-token Preservation | — | 0.730 | — |
| Coherence | Baseline | **Identical on 4/4 tests** | — |

---

## VIII. Geometric Field Modulator (Multi-Expert)

### VIII.1 Architecture

- Three KSimplexChannel experts: k=1 (edge, 2 features), k=2 (triangle, 4 features), k=4 (pentachoron, 11 features)
- **Multiplicative gating**: residual Ɨ Ī (blended_gates) — valid regions pass, invalid suppressed
- **Soft blending**: per expert gate = (1 āˆ’ α) + α Ɨ expert_gate
- **Null space**: 25% of residual dimensions untouched by modulator
- **Alpha clamped**: [0.001, 0.35] — hard ceiling below the phase boundary
- **Gradient scaling**: geometric params at 10% LR, alpha at 50% LR, gates at full LR
- Params: **38,552** (0.064% of T5-Small)
- Self-test: validity=0.985, null space preserved, template volumes sane

### VIII.2 Design Rationale (Grounded in Cross-Architecture Data)

| Data Point | Design Decision |
|---|---|
| Q sparsity 100% at scale | Geometric field can replace Q — the model barely uses it |
| Cross-attn QK locked at 0.500 | Target equilibrium for geometric validity gating |
| Depth gradient always increasing | Per-layer alpha respects this (low early, high late) |
| Zero dead MLP neurons | Don't touch MLPs — all capacity is in use |
| Decoder position: binary L/G split | Modulator preserves positional structure (null space) |
| CV 0.20–0.23 universal | CV monitoring as health check, not loss |

---

## IX. The 0.29154 Constant

### IX.1 Observations Across Systems

| System | Context | Value |
|---|---|---|
| MinimalShunts | CLIP-L ↔ CLIP-G projection gate | Emergent equilibrium |
| Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
| Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
| T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
| Alpha training basins | 0.70 start → settled at 0.695 | Mirror constant 1 āˆ’ 0.29154 = 0.70846, Ī” = 0.013 |

### IX.2 T5 Generation Phase Transition

| Alpha | Output (triangle prompt) |
|---|---|
| 0.01–0.10 | "...three edges and three vertices. it is one of the basic shapes in geometry." |
| 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
| 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
| 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.2915 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
| 0.292 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **the world**." |
| 0.30 | "a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |

**Finding:** 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.

---

## X. Universal Geometric Constants

| Constant | Value | Observed In |
|---|---|---|
| Pentachoron CV | 0.20–0.23 | T5-Small, Qwen 0.8B, Qwen 4B, trained modulator |
| Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
| Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
| Depth gradient | Monotonic increasing | All modulator training runs |
| Q sparsity scaling (T5) | 93.7% → 99.4% → 100.0% | T5-Small → T5-Base → T5-v1.1-XXL |
| Q sparsity asymmetry | **T5 pretraining only** | Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs |
| Cross-modal QK balance | **Locked at 0.500** | T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models) |
| Self-attn QK: adapted models | **Locked at 0.500** | T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers) |
| UNet QK U-gradient | down→repulsion, up→attraction | SD 1.5 (0.451→0.581), SDXL (0.477→0.549) |
| VAE decoder QK | Repulsion-biased | SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416) |
| Attention cross-layer corr | ~0.000 | ALL 17 models, including UNets and VAEs |
| Conv cross-layer corr | ~0.000 | All UNets and VAEs (extends to pure convnets) |
| MLP/FF full utilization | 0.00% dead | T5 family (enc), BERT, DINOv2, UNets, all VAEs |
| Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
| VAE spectral invariant | Pearson 0.94–0.98 | All 6 VAE pairs — SV distribution is architecture-determined |
| VAE Procrustes alignment | 70–76% cosine | All 6 pairs — same solution in different coordinate systems |

---

## XI. Measurement Toolkit Reference

| Tool | Input | Output | Requires Inference |
|---|---|---|---|
| Participation Ratio | Embedding matrix | Effective dimensionality | No |
| Cayley-Menger Volume | 5-point subsets of embeddings | Simplex volume + CV | No |
| Pairwise Cosine | Embedding matrix (sampled) | Similarity distribution | No |
| Digit Manifold | 10 digit token embeddings | |iāˆ’j| correlation, adjacency gap | No |
| SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
| QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
| Dead Neuron Count | MLP wi/gate/up, wo matrices | Combined importance distribution | No |
| Cross-Layer Correlation | Same-type weight matrices | Adjacent cosine similarity | No |
| Position Bias Topology | Relative attention bias tensor | Local/global/mixed head counts | No |
| Sparsity Topology | Any weight matrix | Fraction below threshold | No |
| WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
| Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |

---

## XII. T5Gemma2 — Decoder-Adapted Encoder-Decoder

**Architecture:** Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).

### XII.1 Sparsity

| Model | Q (<0.1) | K (<0.1) | V (<0.1) | Pattern |
|---|---|---|---|---|
| T5Gemma2 1B-1B | 100.0% | 99.9% | 100.0% | **Uniform** |
| T5Gemma2 4B-4B | 100.0% | 100.0% | 100.0% | **Uniform** |

**Finding:** No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.

### XII.2 QK Manifold

| Model | Encoder Self | Decoder Self | All Layers |
|---|---|---|---|
| T5Gemma2 1B | 0.500 (±0.001) | 0.500 (±0.001) | **Locked** |
| T5Gemma2 4B | 0.500 exact | 0.500 exact | **Locked** |

**Finding:** Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.

### XII.3 Other Invariants

- Dead neurons: 0/359,424 (1B), 0/696,320 (4B) — all alive
- Cross-layer Q correlation: ~0.000 — confirmed universal
- MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
- GQA: 4:1 at 1B scale, 2:1 at 4B scale

---

## XIII. Diffusion UNet Weight Topology

### XIII.1 UNet Sparsity

| Model | Self Q | Self K | Self V | Cross Q | Cross K | Cross V |
|---|---|---|---|---|---|---|
| SD 1.5 UNet | **90.5%** | **90.9%** | 97.1% | 96.8% | 94.9% | 98.9% |
| SDXL UNet | 99.9% | 99.9% | 100.0% | 100.0% | 100.0% | 100.0% |

**SD 1.5 is the least sparse model in the entire battery.** 90.5% for self-attention Q — below T5-Small's 93.7%. A parameter-starved model (860M for 512Ɨ512 image generation) uses denser weights. SDXL at 3Ɨ the params reaches near-100%.

**Sparsity traces the U-path (SD 1.5):** down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.

### XIII.2 UNet QK Manifold — The U-Shape

**Self-attention positive eigenvalue fraction through the UNet path:**

| Position | SD 1.5 | SDXL |
|---|---|---|
| down (early) | 0.509 | ~0.49 |
| down (deep) | **0.451** | **0.483** |
| mid (bottleneck) | **0.483** | **0.477** |
| up (early) | 0.501 | 0.501 |
| up (late) | **0.581** | **0.549** |

The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451→0.581 = 0.130 range) because it's more parameter-starved.

**Cross-attention: locked at 0.500 in both UNets.** SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.

### XIII.3 Other UNet Invariants

- Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
- Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
- SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) — extremely concentrated queries to text
- SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) — richest value matrices

---

## XIV. VAE Weight Topology

### XIV.1 Cross-VAE Comparison

| VAE | Params | Latent Ch | Enc (<0.1) | Dec (<0.1) | Enc QK pos | Dec QK pos |
|---|---|---|---|---|---|---|
| SD 1.5 | 83.7M | 4 | 98.6% | 99.1% | 0.496 | 0.486 |
| SDXL | 83.7M | 4 | **29.0%** | **38.1%** | 0.502 | **0.416** |
| Flux.1 | 83.8M | 16 | 96.5% | 97.5% | 0.498 | **0.451** |
| Flux.2 | 84.0M | 32 | 94.3% | 94.3% | **0.393** | **0.416** |

**SDXL VAE is the densest model measured.** 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3Ɨ denser. Attention condition numbers reach 1.16M.

### XIV.2 VAE Decoder QK Breaks Toward Repulsion

| VAE | Latent Ch | Decoder QK pos | Interpretation |
|---|---|---|---|
| SD 1.5 | 4 | 0.486 | Slight repulsion |
| SDXL | 4 (1024² target) | **0.416** | Strong repulsion — 4Ɨ reconstruction challenge |
| Flux.1 | 16 | **0.451** | Moderate repulsion |
| Flux.2 | 32 | **0.416** | Strong repulsion — most channels to separate |

Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination — more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution → stronger repulsion.

**Flux.1 decoder anomaly:** Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.

### XIV.3 VAE Invariants

- Zero dead neurons across all four VAEs
- Conv filter utilization: 100% (active fraction 1.000)
- Cross-layer conv correlation: ~0.000 — universal, extends to pure convnets
- Spectral correlation between VAEs: 0.94–0.98 — architecture determines SV distribution

---

## XV. Procrustes Analysis — VAE Weight-Space Alignment

### XV.1 Methodology

**Orthogonal Procrustes:** For each common weight matrix (same name, same shape), find orthogonal R minimizing ‖A āˆ’ BR‖_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.

**Spectral correlation:** Pearson correlation of normalized singular value distributions.

### XV.2 Pairwise Results

| Pair | Raw Cosine | Procrustes Cosine | Rotation Gain | Spectral Corr |
|---|---|---|---|---|
| SD1.5 vs SDXL | 0.053 | 0.697 | +0.644 | 0.958 |
| SD1.5 vs Flux.1 | 0.091 | 0.730 | +0.640 | 0.964 |
| **SD1.5 vs Flux.2** | **-0.000** | **0.757** | **+0.757** | **0.979** |
| SDXL vs Flux.1 | 0.024 | 0.675 | +0.650 | 0.939 |
| SDXL vs Flux.2 | -0.001 | 0.705 | +0.705 | 0.937 |
| Flux.1 vs Flux.2 | 0.000 | 0.736 | +0.736 | 0.957 |

### XV.3 Key Findings

**1. Raw cosine is zero.** All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.

**2. After Procrustes rotation, 70–76% of structure aligns.** These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization → different basis → same function.

**3. Spectral correlation is 0.94–0.98.** Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix — rank structure, energy distribution — is architecture-determined, not training-determined.

**4. SD 1.5 vs Flux.2 is the most alignable pair.** Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.

**5. SDXL is the geometric outlier.** Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.

### XV.4 Distance Matrices

**Procrustes Residual (lower = more similar):**

| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 0.000 | 0.752 | 0.707 | 0.679 |
| SDXL | 0.752 | 0.000 | 0.774 | 0.739 |
| Flux.1 | 0.707 | 0.774 | 0.000 | 0.699 |
| Flux.2 | 0.679 | 0.739 | 0.699 | 0.000 |

**Spectral Correlation (higher = more similar):**

| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
|---|---|---|---|---|
| SD 1.5 | 1.000 | 0.958 | 0.964 | 0.979 |
| SDXL | 0.958 | 1.000 | 0.939 | 0.937 |
| Flux.1 | 0.964 | 0.939 | 1.000 | 0.957 |
| Flux.2 | 0.979 | 0.937 | 0.957 | 1.000 |

### XV.5 Implication for Geometric Transfer

A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific — the unique basin each training run found.

---

## XVI. Scripts Reference

| Script | Purpose | Key Outputs |
|---|---|---|
| `probe_t5_small_terrain.py` | T5-Small embedding + layer geometry | PR, CV, digit manifold, layer evolution |
| `probe_t5_wordnet_summarize.py` | T5-Small Ɨ WordNet relational alignment | Pearson, Spearman, distance bands, hypernym decay |
| `probe_t5_wordnet_50seeds.py` | 50-seed stability test (GPU-accelerated) | Confidence intervals for all relational metrics |
| `probe_t5_inactive_weights.py` | T5-Small/Base inactive weight topology | SVD, sparsity, QK manifold, dead neurons |
| `cross_architecture_weight_battery.py` | BERT + CLIP + DINOv2 battery | Cross-model comparison table |
| `probe_flux_t5_g4.py` | T5-v1.1-XXL (Flux encoder) full battery | All layers, encoder + decoder + cross-attn |
| `geometric_residual_modulator.py` | LERP modulator + training utilities | Modulator class + measurement tools |
| `geometric_field_modulator.py` | Multi-expert field modulator | KSimplex experts + multiplicative gating |
| `geometric_modulator_full_pipeline.py` | Self-contained T5 + WordNet + modulator | End-to-end pipeline |
| `train_modulator.py` | Training loop for alpha convergence | Freeze T5, train modulator, track alpha |
| `probe_t5gemma2.py` | T5Gemma2 battery (both scales) | GQA handling, adapted enc-dec topology |
| `probe_unet_geometry.py` | SD 1.5 / SDXL UNet battery | U-path QK gradient, cross-attn lock |
| `probe_vae_geometry.py` | All four VAE battery | Conv reshape, bottleneck attention, latent comparison |
| `procrustes_vae_analysis.py` | Pairwise Procrustes on 4 VAEs | Distance matrices, depth profiles, rotation gain |

---

*Last updated: 2026-03-06*
*Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE)*
*Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder)*
*Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction)*
*Procrustes analysis: 6 VAE pairs, 68 weight matrices each*
*Modulator experiments: 4 LERP configurations, 1 field modulator*