AbstractPhil
/

procrustes-analysis

Model card Files Files and versions

xet

Community

AbstractPhil commited on 18 days ago

Commit

f04e79b

verified ·

1 Parent(s): d303fb3

Update README.md

Browse files

Files changed (1) hide show

README.md +186 -13

README.md CHANGED Viewed

@@ -3,8 +3,7 @@ license: mit
 ---
 # Day 2
-# Geometric Terrain Statistics Composite Update; 9 models
 ## Document Purpose
@@ -25,11 +24,23 @@ Running catalog of geometric measurements across language and vision models. Eac
 | CLIP-ViT-bigG/14 | 1.84B (visual) | — | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
 | Qwen3.5-0.8B | 853M | 248,320 | 1024 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
 | Qwen3.5-4B | ~4B | 248,320 | 2560 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
 **Notes:**
 - T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
 - CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
 - T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
 ---
@@ -449,11 +460,17 @@ T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negati
 | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
 | Depth gradient | Monotonic increasing | All modulator training runs |
 | Q sparsity scaling (T5) | 93.7% → 99.4% → 100.0% | T5-Small → T5-Base → T5-v1.1-XXL |
-| Cross-attn QK balance | Locked at 0.500 | T5-v1.1-XXL (all 24 layers) |
-| Attention cross-layer corr | ~0.000 | ALL models profiled (8 models) |
-| MLP cross-layer corr | 0.006–0.055 (positive, decays) | ALL models profiled |
 | Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
-| MLP full utilization | 0.00% dead neurons | T5 family (enc), BERT, DINOv2 |
 ---
@@ -476,17 +493,173 @@ T5-v1.1-XXL cross-attention QK manifold is exactly 0.500 positive / 0.500 negati
 ---
-## XII. Scripts Reference
-| Script | Purpose |
-|---|---|
-| `bulk_experiments_sloppy_with_results.ipynb` | Original sloppy experiment notebook with scattered results. |
-| `experiment_bulk_claude_generated.ipynb` | Notebook rewritten by Claude for consumption by ablation studies and comparative utility. |
 ---
 *Last updated: 2026-03-06*
-*Models profiled: 9 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B)*
-*Cross-architecture battery: 7 models, 4 training objectives (MLM, span corruption, contrastive, self-supervised)*
 *Modulator experiments: 4 LERP configurations, 1 field modulator*

 ---
 # Day 2
+# Geometric Terrain Statistics Composite
 ## Document Purpose
 | CLIP-ViT-bigG/14 | 1.84B (visual) | — | 1664 | 48 | 16 | Vision encoder (fused QKV) | LAION-2B contrastive |
 | Qwen3.5-0.8B | 853M | 248,320 | 1024 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
 | Qwen3.5-4B | ~4B | 248,320 | 2560 | — | — | DeltaNet + MoE + ViT | Multilingual + Vision |
+| T5Gemma2-1B-1B | 2.1B | 262,144 | 1152 | 27+26 | GQA 4:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
+| T5Gemma2-4B-4B | 7.5B | 262,144 | 2560 | 34+34 | GQA 2:1 | Adapted enc-dec (Gemma 2, RoPE, GeGLU) | Gemma 2 decoder → enc-dec |
+| SD 1.5 UNet | 860M | — | [320,640,1280,1280] | 16 attn blocks | 8 | Conv UNet + self/cross attn | LDM diffusion (LAION) |
+| SDXL UNet | 2.6B | — | [320,640,1280] | 70 attn blocks | [5,10,20] | Conv UNet + self/cross attn | LDM diffusion (internal) |
+| SD 1.5 VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (LAION) |
+| SDXL VAE | 83.7M | — | 4 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (internal) |
+| Flux.1 VAE | 83.8M | — | 16 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |
+| Flux.2 VAE | 84.0M | — | 32 latent ch | [128,256,512,512] | — | Conv autoencoder + mid attn | Reconstruction (BFL) |
 **Notes:**
 - T5-v1.1-XXL encoder is the text encoder used by Flux.1 Schnell, Flux.1 Dev, and Flux.2
 - CLIP models use fused QKV (`in_proj_weight`); Q/K/V split by thirds for analysis
 - T5-v1.1 uses GeGLU (wi_0 gate + wi_1 value) instead of ReLU (single wi)
+- T5Gemma2 models are Gemma 2 decoder weights adapted to encoder-decoder; include ViT vision tower
+- UNet attention: attn1 = self-attention (spatial), attn2 = cross-attention (to text encoder)
+- VAE Conv2d weights reshaped to 2D as [out_channels, in_channels * kH * kW] for analysis
+- VAE attention exists only at the bottleneck (mid_block) — one in encoder, one in decoder
 ---
 | Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
 | Depth gradient | Monotonic increasing | All modulator training runs |
 | Q sparsity scaling (T5) | 93.7% → 99.4% → 100.0% | T5-Small → T5-Base → T5-v1.1-XXL |
+| Q sparsity asymmetry | **T5 pretraining only** | Present in T5, absent in T5Gemma2, BERT, DINOv2, UNets, VAEs |
+| Cross-modal QK balance | **Locked at 0.500** | T5-v1.1-XXL cross-attn, T5Gemma2 (both), SD 1.5 UNet, SDXL UNet (6 models) |
+| Self-attn QK: adapted models | **Locked at 0.500** | T5Gemma2 1B (all 53 layers), T5Gemma2 4B (all 68 layers) |
+| UNet QK U-gradient | down→repulsion, up→attraction | SD 1.5 (0.451→0.581), SDXL (0.477→0.549) |
+| VAE decoder QK | Repulsion-biased | SD 1.5 (0.486), SDXL (0.416), Flux.1 (0.451), Flux.2 (0.416) |
+| Attention cross-layer corr | ~0.000 | ALL 17 models, including UNets and VAEs |
+| Conv cross-layer corr | ~0.000 | All UNets and VAEs (extends to pure convnets) |
+| MLP/FF full utilization | 0.00% dead | T5 family (enc), BERT, DINOv2, UNets, all VAEs |
 | Decoder position crystallization | 0 mixed heads | T5-Small, T5-v1.1-XXL |
+| VAE spectral invariant | Pearson 0.94–0.98 | All 6 VAE pairs — SV distribution is architecture-determined |
+| VAE Procrustes alignment | 70–76% cosine | All 6 pairs — same solution in different coordinate systems |
 ---
 ---
+## XII. T5Gemma2 — Decoder-Adapted Encoder-Decoder
+**Architecture:** Gemma 2 decoder weights adapted to encoder-decoder. GQA (grouped query attention), RoPE, GeGLU MLPs. Multimodal (ViT in encoder).
+### XII.1 Sparsity
+| Model | Q (<0.1) | K (<0.1) | V (<0.1) | Pattern |
+|---|---|---|---|---|
+| T5Gemma2 1B-1B | 100.0% | 99.9% | 100.0% | **Uniform** |
+| T5Gemma2 4B-4B | 100.0% | 100.0% | 100.0% | **Uniform** |
+**Finding:** No Q/K asymmetry. The T5 Q sparsity pattern is ABSENT when the encoder is initialized from decoder weights. The asymmetry is a property of T5's span corruption pretraining, not the encoder-decoder architecture.
+### XII.2 QK Manifold
+| Model | Encoder Self | Decoder Self | All Layers |
+|---|---|---|---|
+| T5Gemma2 1B | 0.500 (±0.001) | 0.500 (±0.001) | **Locked** |
+| T5Gemma2 4B | 0.500 exact | 0.500 exact | **Locked** |
+**Finding:** Perfect 0.500 lock across ALL layers in BOTH encoder and decoder. Symmetry deviation √2 everywhere. The Gemma 2 initialization left the QK matrices near random-matrix equilibrium. The adaptation to encoder-decoder didn't perturb them enough to break Wigner semicircle symmetry.
+### XII.3 Other Invariants
+- Dead neurons: 0/359,424 (1B), 0/696,320 (4B) — all alive
+- Cross-layer Q correlation: ~0.000 — confirmed universal
+- MLP utilization: 100% (1 weak neuron each in enc L6 and dec L6 at 4B scale)
+- GQA: 4:1 at 1B scale, 2:1 at 4B scale
+---
+## XIII. Diffusion UNet Weight Topology
+### XIII.1 UNet Sparsity
+| Model | Self Q | Self K | Self V | Cross Q | Cross K | Cross V |
+|---|---|---|---|---|---|---|
+| SD 1.5 UNet | **90.5%** | **90.9%** | 97.1% | 96.8% | 94.9% | 98.9% |
+| SDXL UNet | 99.9% | 99.9% | 100.0% | 100.0% | 100.0% | 100.0% |
+**SD 1.5 is the least sparse model in the entire battery.** 90.5% for self-attention Q — below T5-Small's 93.7%. A parameter-starved model (860M for 512×512 image generation) uses denser weights. SDXL at 3× the params reaches near-100%.
+**Sparsity traces the U-path (SD 1.5):** down=88.9%, mid=99.3%, up=89.4%. The bottleneck has the most diffuse weights; the periphery has the densest.
+### XIII.2 UNet QK Manifold — The U-Shape
+**Self-attention positive eigenvalue fraction through the UNet path:**
+| Position | SD 1.5 | SDXL |
+|---|---|---|
+| down (early) | 0.509 | ~0.49 |
+| down (deep) | **0.451** | **0.483** |
+| mid (bottleneck) | **0.483** | **0.477** |
+| up (early) | 0.501 | 0.501 |
+| up (late) | **0.581** | **0.549** |
+The QK manifold traces the U-shape: repulsion-dominated downpath (compressing, discriminating), maximum repulsion at bottleneck, rising to attraction-dominated uppath (reconstructing, grouping). SD 1.5 shows the wider swing (0.451→0.581 = 0.130 range) because it's more parameter-starved.
+**Cross-attention: locked at 0.500 in both UNets.** SD 1.5: mean=0.501, std=0.001. SDXL: mean=0.500, std=0.001. The fifth and sixth confirmations of the cross-modal QK lock.
+### XIII.3 Other UNet Invariants
+- Dead neurons: 0/23,040 (SD 1.5), 0/163,840 (SDXL)
+- Cross-block Q correlation: ~0.000 (both self-attn and cross-attn)
+- SDXL cross-attn Q stable rank: 13.97 (lowest of any weight type) — extremely concentrated queries to text
+- SDXL cross-attn V: highest stable rank (165.9) and lowest condition number (15.8) — richest value matrices
+---
+## XIV. VAE Weight Topology
+### XIV.1 Cross-VAE Comparison
+| VAE | Params | Latent Ch | Enc (<0.1) | Dec (<0.1) | Enc QK pos | Dec QK pos |
+|---|---|---|---|---|---|---|
+| SD 1.5 | 83.7M | 4 | 98.6% | 99.1% | 0.496 | 0.486 |
+| SDXL | 83.7M | 4 | **29.0%** | **38.1%** | 0.502 | **0.416** |
+| Flux.1 | 83.8M | 16 | 96.5% | 97.5% | 0.498 | **0.451** |
+| Flux.2 | 84.0M | 32 | 94.3% | 94.3% | **0.393** | **0.416** |
+**SDXL VAE is the densest model measured.** 29% encoder sparsity at 0.1 threshold. Identical architecture and param count to SD 1.5, but weights are 3× denser. Attention condition numbers reach 1.16M.
+### XIV.2 VAE Decoder QK Breaks Toward Repulsion
+| VAE | Latent Ch | Decoder QK pos | Interpretation |
+|---|---|---|---|
+| SD 1.5 | 4 | 0.486 | Slight repulsion |
+| SDXL | 4 (1024² target) | **0.416** | Strong repulsion — 4× reconstruction challenge |
+| Flux.1 | 16 | **0.451** | Moderate repulsion |
+| Flux.2 | 32 | **0.416** | Strong repulsion — most channels to separate |
+Decoder bottleneck attention breaks symmetry toward repulsion. Reconstruction requires spatial discrimination — more negative eigenvalues = finer spatial separation. More latent channels or higher target resolution → stronger repulsion.
+**Flux.1 decoder anomaly:** Top eigenvalue = 60,807 (typical is 2–150). One attention direction completely dominates. Rank-1 approximation of the attention space.
+### XIV.3 VAE Invariants
+- Zero dead neurons across all four VAEs
+- Conv filter utilization: 100% (active fraction 1.000)
+- Cross-layer conv correlation: ~0.000 — universal, extends to pure convnets
+- Spectral correlation between VAEs: 0.94–0.98 — architecture determines SV distribution
+---
+## XV. Procrustes Analysis — VAE Weight-Space Alignment
+### XV.1 Methodology
+**Orthogonal Procrustes:** For each common weight matrix (same name, same shape), find orthogonal R minimizing ‖A − BR‖_F via SVD of B^TA. Report residual (0 = identical up to rotation, √2 = orthogonal) and cosine after alignment.
+**Spectral correlation:** Pearson correlation of normalized singular value distributions.
+### XV.2 Pairwise Results
+| Pair | Raw Cosine | Procrustes Cosine | Rotation Gain | Spectral Corr |
+|---|---|---|---|---|
+| SD1.5 vs SDXL | 0.053 | 0.697 | +0.644 | 0.958 |
+| SD1.5 vs Flux.1 | 0.091 | 0.730 | +0.640 | 0.964 |
+| **SD1.5 vs Flux.2** | **-0.000** | **0.757** | **+0.757** | **0.979** |
+| SDXL vs Flux.1 | 0.024 | 0.675 | +0.650 | 0.939 |
+| SDXL vs Flux.2 | -0.001 | 0.705 | +0.705 | 0.937 |
+| Flux.1 vs Flux.2 | 0.000 | 0.736 | +0.736 | 0.957 |
+### XV.3 Key Findings
+**1. Raw cosine is zero.** All pairs. Weights are orthogonal in raw space. Naive comparison says these VAEs share nothing. This is wrong.
+**2. After Procrustes rotation, 70–76% of structure aligns.** These models found the SAME geometric solution, expressed in different coordinate systems. Different initialization → different basis → same function.
+**3. Spectral correlation is 0.94–0.98.** Singular value distributions are nearly identical across all pairs. The "shape" of each weight matrix — rank structure, energy distribution — is architecture-determined, not training-determined.
+**4. SD 1.5 vs Flux.2 is the most alignable pair.** Raw cosine literally zero, but highest Procrustes cosine (0.757) and highest spectral correlation (0.979). The most different training produces the most alignable weights. Shared structure is deepest when surface differences are greatest.
+**5. SDXL is the geometric outlier.** Lowest Procrustes cosine with every model (0.675–0.705). Found a more distant basin despite identical architecture to SD 1.5.
+### XV.4 Distance Matrices
+**Procrustes Residual (lower = more similar):**
+| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
+|---|---|---|---|---|
+| SD 1.5 | 0.000 | 0.752 | 0.707 | 0.679 |
+| SDXL | 0.752 | 0.000 | 0.774 | 0.739 |
+| Flux.1 | 0.707 | 0.774 | 0.000 | 0.699 |
+| Flux.2 | 0.679 | 0.739 | 0.699 | 0.000 |
+**Spectral Correlation (higher = more similar):**
+| | SD 1.5 | SDXL | Flux.1 | Flux.2 |
+|---|---|---|---|---|
+| SD 1.5 | 1.000 | 0.958 | 0.964 | 0.979 |
+| SDXL | 0.958 | 1.000 | 0.939 | 0.937 |
+| Flux.1 | 0.964 | 0.939 | 1.000 | 0.957 |
+| Flux.2 | 0.979 | 0.937 | 0.957 | 1.000 |
+### XV.5 Implication for Geometric Transfer
+A geometric field modulator trained on one VAE can be ROTATED to work on another via the Procrustes R matrix. 70–76% structural alignment means the modulator captures the shared geometric invariant. The remaining 24–30% is model-specific — the unique basin each training run found.
+---
 ---
 *Last updated: 2026-03-06*
+*Models profiled: 17 (T5-Small, T5-Base, T5-v1.1-XXL, BERT-large, CLIP-ViT-B/16, DINOv2-large, CLIP-ViT-bigG, Qwen3.5-0.8B, Qwen3.5-4B, T5Gemma2-1B, T5Gemma2-4B, SD 1.5 UNet, SDXL UNet, SD 1.5 VAE, SDXL VAE, Flux.1 VAE, Flux.2 VAE)*
+*Architecture families: 5 (Transformer enc-dec, encoder-only/vision, adapted enc-dec, conv UNet, conv autoencoder)*
+*Training objectives: 6 (span corruption, MLM, contrastive, self-supervised, diffusion, reconstruction)*
+*Procrustes analysis: 6 VAE pairs, 68 weight matrices each*
 *Modulator experiments: 4 LERP configurations, 1 field modulator*