AbstractPhil commited on
Commit
f04e79b
Β·
verified Β·
1 Parent(s): d303fb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +186 -13
README.md CHANGED
@@ -3,8 +3,7 @@ license: mit
3
  ---
4
 
5
  # Day 2
6
-
7
- # Geometric Terrain Statistics Composite Update; 9 models
8
 
9
  ## Document Purpose
10
 
@@ -25,11 +24,23 @@ Running catalog of geometric measurements across language and vision models. Eac
25
  | CLIP-ViT-bigG/14 | 1.84B (visual) | β€” | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
26
  | Qwen3.5-0.8B | 853M | 248,320 | 1024 | β€” | β€” | DeltaNet + MoE + ViT | Multilingual + Vision |
27
  | Qwen3.5-4B | ~4B | 248,320 | 2560 | β€” | β€” | DeltaNet + MoE + ViT | Multilingual + Vision |
 
 
 
 
 
 
 
 
28
 
29
  **Notes:**
30
  - T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
31
  - CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
32
  - T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
 
 
 
 
33
 
34
  ---
35
 
@@ -449,11 +460,17 @@ T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negati
449
  | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
450
  | Depth gradient | Monotonic increasing | All modulator training runs |
451
  | Q sparsity scaling (T5) | 93.7% β†’ 99.4% β†’ 100.0% | T5-Small β†’ T5-Base β†’ T5-v1.1-XXL |
452
- | Cross-attn QK balance | Locked at 0.500 | T5-v1.1-XXL (all 24 layers) |
453
- | Attention cross-layer corr | ~0.000 | ALL models profiled (8 models) |
454
- | MLP cross-layer corr | 0.006–0.055 (positive, decays) | ALL models profiled |
 
 
 
 
 
455
  | Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
456
- | MLP full utilization | 0.00% dead neurons | T5 family (enc), BERT, DINOv2 |
 
457
 
458
  ---
459
 
@@ -476,17 +493,173 @@ T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negati
476
 
477
  ---
478
 
479
- ## XII. Scripts Reference
480
 
481
- | Script | Purpose |
482
- |---|---|
483
- | `bulk_experiments_sloppy_with_results.ipynb` | Original sloppy experiment notebook with scattered results. |
484
- | `experiment_bulk_claude_generated.ipynb` | Notebook rewritten by Claude for consumption by ablation studies and comparative utility. |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
485
 
486
 
487
  ---
488
 
489
  *Last updated: 2026-03-06*
490
- *Models profiled: 9 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B)*
491
- *Cross-architecture battery: 7 models, 4 training objectives (MLM, span corruption, contrastive, self-supervised)*
 
 
492
  *Modulator experiments: 4 LERP configurations, 1 field modulator*
 
3
  ---
4
 
5
  # Day 2
6
+ # Geometric Terrain Statistics Composite
 
7
 
8
  ## Document Purpose
9
 
 
24
  | CLIP-ViT-bigG/14 | 1.84B (visual) | β€” | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
25
  | Qwen3.5-0.8B | 853M | 248,320 | 1024 | β€” | β€” | DeltaNet + MoE + ViT | Multilingual + Vision |
26
  | Qwen3.5-4B | ~4B | 248,320 | 2560 | β€” | β€” | DeltaNet + MoE + ViT | Multilingual + Vision |
27
+ | T5Gemma2-1B-1B | 2.1B | 262,144 | 1152 | 27+26 | GQA 4:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder β†’ enc-dec |
28
+ | T5Gemma2-4B-4B | 7.5B | 262,144 | 2560 | 34+34 | GQA 2:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder β†’ enc-dec |
29
+ | SD 1.5 UNet | 860M | β€” | [320,640,1280,1280] | 16 attn blocks | 8 | Conv UNet + self/cross attn | LDM diffusion (LAION) |
30
+ | SDXL UNet | 2.6B | β€” | [320,640,1280] | 70 attn blocks | [5,10,20] | Conv UNet + self/cross attn | LDM diffusion (internal) |
31
+ | SD 1.5 VAE | 83.7M | β€” | 4 latent ch | [128,256,512,512] | β€” | Conv autoencoder + mid attn | Reconstruction (LAION) |
32
+ | SDXL VAE | 83.7M | β€” | 4 latent ch | [128,256,512,512] | β€” | Conv autoencoder + mid attn | Reconstruction (internal) |
33
+ | Flux.1 VAE | 83.8M | β€” | 16 latent ch | [128,256,512,512] | β€” | Conv autoencoder + mid attn | Reconstruction (BFL) |
34
+ | Flux.2 VAE | 84.0M | β€” | 32 latent ch | [128,256,512,512] | β€” | Conv autoencoder + mid attn | Reconstruction (BFL) |
35
 
36
  **Notes:**
37
  - T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
38
  - CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
39
  - T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
40
+ - T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
41
+ - UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
42
+ - VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
43
+ - VAE attention exists only at the bottleneck (mid_block) β€” one in encoder, one in decoder
44
 
45
  ---
46
 
 
460
  | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
461
  | Depth gradient | Monotonic increasing | All modulator training runs |
462
  | Q sparsity scaling (T5) | 93.7% β†’ 99.4% β†’ 100.0% | T5-Small β†’ T5-Base β†’ T5-v1.1-XXL |
463
+ | Q sparsity asymmetry | **T5 pretraining only** | Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs |
464
+ | Cross-modal QK balance | **Locked at 0.500** | T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models) |
465
+ | Self-attn QK: adapted models | **Locked at 0.500** | T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers) |
466
+ | UNet QK U-gradient | down→repulsion, up→attraction | SD 1.5 (0.451→0.581), SDXL (0.477→0.549) |
467
+ | VAE decoder QK | Repulsion-biased | SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416) |
468
+ | Attention cross-layer corr | ~0.000 | ALL 17 models, including UNets and VAEs |
469
+ | Conv cross-layer corr | ~0.000 | All UNets and VAEs (extends to pure convnets) |
470
+ | MLP/FF full utilization | 0.00% dead | T5 family (enc), BERT, DINOv2, UNets, all VAEs |
471
  | Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
472
+ | VAE spectral invariant | Pearson 0.94–0.98 | All 6 VAE pairs β€” SV distribution is architecture-determined |
473
+ | VAE Procrustes alignment | 70–76% cosine | All 6 pairs β€” same solution in different coordinate systems |
474
 
475
  ---
476
 
 
493
 
494
  ---
495
 
496
+ ## XII. T5Gemma2 β€” Decoder-Adapted Encoder-Decoder
497
 
498
+ **Architecture:** Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).
499
+
500
+ ### XII.1 Sparsity
501
+
502
+ | Model | Q (<0.1) | K (<0.1) | V (<0.1) | Pattern |
503
+ |---|---|---|---|---|
504
+ | T5Gemma2 1B-1B | 100.0% | 99.9% | 100.0% | **Uniform** |
505
+ | T5Gemma2 4B-4B | 100.0% | 100.0% | 100.0% | **Uniform** |
506
+
507
+ **Finding:** No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.
508
+
509
+ ### XII.2 QK Manifold
510
+
511
+ | Model | Encoder Self | Decoder Self | All Layers |
512
+ |---|---|---|---|
513
+ | T5Gemma2 1B | 0.500 (Β±0.001) | 0.500 (Β±0.001) | **Locked** |
514
+ | T5Gemma2 4B | 0.500 exact | 0.500 exact | **Locked** |
515
+
516
+ **Finding:** Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.
517
+
518
+ ### XII.3 Other Invariants
519
+
520
+ - Dead neurons: 0/359,424 (1B), 0/696,320 (4B) β€” all alive
521
+ - Cross-layer Q correlation: ~0.000 β€” confirmed universal
522
+ - MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
523
+ - GQA: 4:1 at 1B scale, 2:1 at 4B scale
524
+
525
+ ---
526
+
527
+ ## XIII. Diffusion UNet Weight Topology
528
+
529
+ ### XIII.1 UNet Sparsity
530
+
531
+ | Model | Self Q | Self K | Self V | Cross Q | Cross K | Cross V |
532
+ |---|---|---|---|---|---|---|
533
+ | SD 1.5 UNet | **90.5%** | **90.9%** | 97.1% | 96.8% | 94.9% | 98.9% |
534
+ | SDXL UNet | 99.9% | 99.9% | 100.0% | 100.0% | 100.0% | 100.0% |
535
+
536
+ **SD 1.5 is the least sparse model in the entire battery.** 90.5% for self-attention Q β€” below T5-Small's 93.7%. A parameter-starved model (860M for 512Γ—512 image generation) uses denser weights. SDXL at 3Γ— the params reaches near-100%.
537
+
538
+ **Sparsity traces the U-path (SD 1.5):** down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.
539
+
540
+ ### XIII.2 UNet QK Manifold β€” The U-Shape
541
+
542
+ **Self-attention positive eigenvalue fraction through the UNet path:**
543
+
544
+ | Position | SD 1.5 | SDXL |
545
+ |---|---|---|
546
+ | down (early) | 0.509 | ~0.49 |
547
+ | down (deep) | **0.451** | **0.483** |
548
+ | mid (bottleneck) | **0.483** | **0.477** |
549
+ | up (early) | 0.501 | 0.501 |
550
+ | up (late) | **0.581** | **0.549** |
551
+
552
+ The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451β†’0.581 = 0.130 range) because it's more parameter-starved.
553
+
554
+ **Cross-attention: locked at 0.500 in both UNets.** SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.
555
+
556
+ ### XIII.3 Other UNet Invariants
557
+
558
+ - Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
559
+ - Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
560
+ - SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) β€” extremely concentrated queries to text
561
+ - SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) β€” richest value matrices
562
+
563
+ ---
564
+
565
+ ## XIV. VAE Weight Topology
566
+
567
+ ### XIV.1 Cross-VAE Comparison
568
+
569
+ | VAE | Params | Latent Ch | Enc (<0.1) | Dec (<0.1) | Enc QK pos | Dec QK pos |
570
+ |---|---|---|---|---|---|---|
571
+ | SD 1.5 | 83.7M | 4 | 98.6% | 99.1% | 0.496 | 0.486 |
572
+ | SDXL | 83.7M | 4 | **29.0%** | **38.1%** | 0.502 | **0.416** |
573
+ | Flux.1 | 83.8M | 16 | 96.5% | 97.5% | 0.498 | **0.451** |
574
+ | Flux.2 | 84.0M | 32 | 94.3% | 94.3% | **0.393** | **0.416** |
575
+
576
+ **SDXL VAE is the densest model measured.** 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3Γ— denser. Attention condition numbers reach 1.16M.
577
+
578
+ ### XIV.2 VAE Decoder QK Breaks Toward Repulsion
579
+
580
+ | VAE | Latent Ch | Decoder QK pos | Interpretation |
581
+ |---|---|---|---|
582
+ | SD 1.5 | 4 | 0.486 | Slight repulsion |
583
+ | SDXL | 4 (1024Β² target) | **0.416** | Strong repulsion β€” 4Γ— reconstruction challenge |
584
+ | Flux.1 | 16 | **0.451** | Moderate repulsion |
585
+ | Flux.2 | 32 | **0.416** | Strong repulsion β€” most channels to separate |
586
+
587
+ Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination β€” more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution β†’ stronger repulsion.
588
+
589
+ **Flux.1 decoder anomaly:** Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.
590
+
591
+ ### XIV.3 VAE Invariants
592
+
593
+ - Zero dead neurons across all four VAEs
594
+ - Conv filter utilization: 100% (active fraction 1.000)
595
+ - Cross-layer conv correlation: ~0.000 β€” universal, extends to pure convnets
596
+ - Spectral correlation between VAEs: 0.94–0.98 β€” architecture determines SV distribution
597
+
598
+ ---
599
+
600
+ ## XV. Procrustes Analysis β€” VAE Weight-Space Alignment
601
+
602
+ ### XV.1 Methodology
603
+
604
+ **Orthogonal Procrustes:** For each common weight matrix (same name, same shape), find orthogonal R minimizing β€–A βˆ’ BRβ€–_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.
605
+
606
+ **Spectral correlation:** Pearson correlation of normalized singular value distributions.
607
+
608
+ ### XV.2 Pairwise Results
609
+
610
+ | Pair | Raw Cosine | Procrustes Cosine | Rotation Gain | Spectral Corr |
611
+ |---|---|---|---|---|
612
+ | SD1.5 vs SDXL | 0.053 | 0.697 | +0.644 | 0.958 |
613
+ | SD1.5 vs Flux.1 | 0.091 | 0.730 | +0.640 | 0.964 |
614
+ | **SD1.5 vs Flux.2** | **-0.000** | **0.757** | **+0.757** | **0.979** |
615
+ | SDXL vs Flux.1 | 0.024 | 0.675 | +0.650 | 0.939 |
616
+ | SDXL vs Flux.2 | -0.001 | 0.705 | +0.705 | 0.937 |
617
+ | Flux.1 vs Flux.2 | 0.000 | 0.736 | +0.736 | 0.957 |
618
+
619
+ ### XV.3 Key Findings
620
+
621
+ **1. Raw cosine is zero.** All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.
622
+
623
+ **2. After Procrustes rotation, 70–76% of structure aligns.** These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization β†’ different basis β†’ same function.
624
+
625
+ **3. Spectral correlation is 0.94–0.98.** Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix β€” rank structure, energy distribution β€” is architecture-determined, not training-determined.
626
+
627
+ **4. SD 1.5 vs Flux.2 is the most alignable pair.** Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.
628
+
629
+ **5. SDXL is the geometric outlier.** Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.
630
+
631
+ ### XV.4 Distance Matrices
632
+
633
+ **Procrustes Residual (lower = more similar):**
634
+
635
+ | | SD 1.5 | SDXL | Flux.1 | Flux.2 |
636
+ |---|---|---|---|---|
637
+ | SD 1.5 | 0.000 | 0.752 | 0.707 | 0.679 |
638
+ | SDXL | 0.752 | 0.000 | 0.774 | 0.739 |
639
+ | Flux.1 | 0.707 | 0.774 | 0.000 | 0.699 |
640
+ | Flux.2 | 0.679 | 0.739 | 0.699 | 0.000 |
641
+
642
+ **Spectral Correlation (higher = more similar):**
643
+
644
+ | | SD 1.5 | SDXL | Flux.1 | Flux.2 |
645
+ |---|---|---|---|---|
646
+ | SD 1.5 | 1.000 | 0.958 | 0.964 | 0.979 |
647
+ | SDXL | 0.958 | 1.000 | 0.939 | 0.937 |
648
+ | Flux.1 | 0.964 | 0.939 | 1.000 | 0.957 |
649
+ | Flux.2 | 0.979 | 0.937 | 0.957 | 1.000 |
650
+
651
+ ### XV.5 Implication for Geometric Transfer
652
+
653
+ A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific β€” the unique basin each training run found.
654
+
655
+ ---
656
 
657
 
658
  ---
659
 
660
  *Last updated: 2026-03-06*
661
+ *Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE)*
662
+ *Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder)*
663
+ *Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction)*
664
+ *Procrustes analysis: 6 VAE pairs, 68 weight matrices each*
665
  *Modulator experiments: 4 LERP configurations, 1 field modulator*