AbstractPhil
/

procrustes-analysis

Model card Files Files and versions

xet

Community

AbstractPhil commited on 18 days ago

Commit

b52a06f

verified ·

1 Parent(s): 7a539da

Update README.md

Browse files

Files changed (1) hide show

README.md +495 -3

README.md CHANGED Viewed

@@ -1,3 +1,495 @@
----
-license: mit
----

+---
+license: mit
+---
+# Day 1
+# Geometric Terrain Statistics Composite
+Such a quaint little tool.
+```
+class GeometricResidualModulator(nn.Module):
+    def __init__(self, d_model=512, vocab_size=32128, n_geometric_dims=64,
+                 initial_alpha=0.01, n_layers=6):
+        super().__init__()
+        self.d_model = d_model
+        self.n_geometric_dims = n_geometric_dims
+        self.geometric_embed = nn.Embedding(vocab_size, n_geometric_dims)
+        self.proj = nn.Linear(n_geometric_dims, d_model, bias=False)
+        logit = math.log(initial_alpha / (1 - initial_alpha))
+        self.alpha = nn.Parameter(torch.full((n_layers,), logit))
+        nn.init.normal_(self.proj.weight, std=0.01)
+    def forward(self, residual, token_ids, layer_idx=0):
+        geo = self.geometric_embed(token_ids)
+        geo_projected = self.proj(geo)
+        a = torch.sigmoid(self.alpha[layer_idx])
+        return (1 - a) * residual + a * geo_projected
+    def geometric_residuals(self):
+        W = self.geometric_embed.weight
+        W_n = F.normalize(W, dim=1)
+        idx = torch.randperm(min(W.shape[0], 5000))[:5000]
+        sample = W_n[idx]
+        cos_mat = sample @ sample.T
+        tri = torch.triu_indices(len(idx), len(idx), offset=1)
+        flat_cos = cos_mat[tri[0], tri[1]]
+        norms = W.norm(dim=1)
+        centered = W - W.mean(dim=0)
+        cov = (centered.T @ centered) / W.shape[0]
+        eigvals = torch.linalg.eigvalsh(cov)
+        pr = (eigvals.sum() ** 2) / (eigvals ** 2).sum()
+        return {
+            'cos_mean': flat_cos.mean().item(),
+            'cos_std': flat_cos.std().item(),
+            'norm_mean': norms.mean().item(),
+            'pr_over_dim': (pr / self.n_geometric_dims).item(),
+            'alpha': torch.sigmoid(self.alpha).detach().cpu().numpy(),
+        }
+class ModulatedT5Encoder(nn.Module):
+    def __init__(self, t5_encoder, modulator, modulate_layers=None):
+        super().__init__()
+        self.encoder = t5_encoder
+        self.modulator = modulator
+        if modulate_layers is None:
+            modulate_layers = list(range(len(t5_encoder.block)))
+        self.modulate_layers = set(modulate_layers)
+    def forward(self, input_ids, attention_mask=None, output_hidden_states=False, **kwargs):
+        hidden_states = self.encoder.embed_tokens(input_ids)
+        hidden_states = self.encoder.dropout(hidden_states)
+        if attention_mask is not None:
+            extended_attention_mask = attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
+            extended_attention_mask = (1.0 - extended_attention_mask) * torch.finfo(hidden_states.dtype).min
+        else:
+            extended_attention_mask = None
+        all_hidden_states = [hidden_states] if output_hidden_states else None
+        position_bias = None
+        seq_length = input_ids.shape[1]
+        cache_position = torch.arange(seq_length, device=input_ids.device)
+        for i, block in enumerate(self.encoder.block):
+            if i in self.modulate_layers:
+                hidden_states = self.modulator(hidden_states, input_ids, layer_idx=i)
+            block_output = block(hidden_states, attention_mask=extended_attention_mask,
+                                 position_bias=position_bias, cache_position=cache_position)
+            hidden_states = block_output[0]
+            if position_bias is None:
+                for out in block_output[1:]:
+                    if isinstance(out, torch.Tensor) and out.dim() == 4:
+                        position_bias = out
+                        break
+            if output_hidden_states:
+                all_hidden_states.append(hidden_states)
+        hidden_states = self.encoder.final_layer_norm(hidden_states)
+        hidden_states = self.encoder.dropout(hidden_states)
+        if output_hidden_states:
+            all_hidden_states.append(hidden_states)
+        return type('Output', (), {
+            'last_hidden_state': hidden_states,
+            'hidden_states': tuple(all_hidden_states) if all_hidden_states else None,
+        })()
+N_GEO = 64
+modulator = GeometricResidualModulator(
+    d_model=512, vocab_size=32128, n_geometric_dims=N_GEO,
+    initial_alpha=0.5, n_layers=6,
+).to(device)
+mod_encoder = ModulatedT5Encoder(
+    t5_encoder=model.encoder, modulator=modulator,
+    modulate_layers=[0, 1, 2, 3, 4, 5],
+)
+```
+## Document Purpose
+Running catalog of geometric measurements across language models. Each metric includes its formula, measurement process, and cross-model results. Designed for expansion as new models and experiments are added.
+---
+## I. Models Profiled
+| Model | Params | Vocab | Hidden Dim | Layers | Architecture | Training Data |
+|---|---|---|---|---|---|---|
+| T5-Small | 60.5M | 32,128 | 512 | 6+6 enc-dec | Transformer (relative PE) | C4 |
+| Qwen3.5-0.8B | 853M (752M LM + 100M ViT) | 248,320 | 1024 | DeltaNet + MoE | Multilingual + Vision |
+| Qwen3.5-4B | ~4B | 248,320 | 2560 | DeltaNet + MoE | Multilingual + Vision |
+---
+## II. Embedding Geometry Metrics
+### II.1 Participation Ratio (Effective Dimensionality)
+**Formula:** PR = (Σλᵢ)² / Σ(λᵢ²), where λᵢ are eigenvalues of the embedding covariance matrix.
+**Process:** Center embeddings (subtract mean), compute covariance C = EᵀE / N, eigendecompose. PR counts effective number of dimensions used. PR/dim normalizes to [0, 1].
+| Model | PR | PR / dim | Dims for 95% var |
+|---|---|---|---|
+| T5-Small (512d) | 287.2 | **0.561** | 379 (74.0%) |
+| Qwen3.5-0.8B (1024d) | 547.7 | **0.535** | 893 (87.2%) |
+| Qwen3.5-4B (2560d) | 812.4 | **0.317** | 2125 (83.0%) |
+**Finding:** PR/dim ≈ 0.53–0.56 for smaller models. Appears to be a universal attractor for embedding dimensionality utilization.
+### II.2 Pairwise Cosine Similarity Distribution
+**Formula:** cos(eᵢ, eⱼ) = (eᵢ · eⱼ) / (‖eᵢ‖ · ‖eⱼ‖), sampled over 5K random tokens (12.5M pairs).
+**Process:** Random sample 5K token embeddings, L2-normalize, compute full pairwise cosine matrix, extract upper triangle.
+| Model | Mean | Std | Median | 1% | 99% |
+|---|---|---|---|---|---|
+| T5-Small | 0.057 | 0.060 | 0.053 | -0.068 | 0.225 |
+| Qwen3.5-0.8B | 0.195 | 0.085 | 0.197 | -0.016 | 0.408 |
+| Qwen3.5-4B | 0.142 | 0.078 | 0.139 | -0.029 | 0.356 |
+**Finding:** T5 is near-orthogonal (span corruption objective). Qwen has positive bias (autoregressive next-token prediction pushes shared "being a token" component).
+### II.3 Embedding Norm Distribution
+**Formula:** ‖eᵢ‖₂ = √(Σeᵢⱼ²)
+| Model | Mean Norm | Std | Min | Max |
+|---|---|---|---|---|
+| T5-Small | 520.15 | 69.84 | 243.31 | 1333.61 |
+| Qwen3.5-0.8B | 0.627 | 0.062 | 0.347 | 1.057 |
+| Qwen3.5-4B | 0.656 | 0.067 | 0.400 | 1.091 |
+**Note:** T5 embeddings are unnormalized (large magnitudes). Qwen embeddings are near-unit norm. This affects downstream metric scaling but not relational structure.
+---
+## III. Simplex Geometry Metrics
+### III.1 Pentachoron Volume (Cayley-Menger Determinant)
+**Formula:** For 5 points P₀...P₄, construct the bordered distance matrix:
+```
+D = | 0  1    1    1    1    1   |
+    | 1  0    d₀₁² d₀₂² d₀₃² d₀₄²|
+    | 1  d₁₀² 0    d₁₂² d₁₃² d₁₄²|
+    | 1  d₂₀² d₂₁² 0    d₂₃² d₂₄²|
+    | 1  d₃₀² d₃₁² d₃₂² 0    d₃₄²|
+    | 1  d₄₀² d₄₁² d₄₂² d₄₃² 0   |
+Vol² = (-1)⁵ · det(D) / (2⁴ · (4!)²) = -det(D) / 9216
+Vol = √(Vol²) if Vol² > 0, else invalid
+```
+**Process:** Sample 1000 random 5-token subsets. Compute Cayley-Menger volume for each. Compare to random Gaussian baseline (same norm distribution). Report CV (coefficient of variation = std/mean) and embed/random ratio.
+| Model | Valid/1000 | CV | Embed/Random Ratio |
+|---|---|---|---|
+| T5-Small | 1000 | **0.233** | 0.855 |
+| Qwen3.5-0.8B | 1000 | **0.208** | 0.984 |
+| Qwen3.5-4B | 1000 | **0.222** | 0.988 |
+**Finding:** CV 0.20–0.23 is a universal attractor. All models pack simplices with similar evenness regardless of architecture, scale, or training data. The "pentachoron packing constant."
+### III.2 Cross-Model Relational Structure
+**Formula:** For shared tokens between two models, compute pairwise cosine matrices in each model's embedding space. Pearson correlation between flattened upper triangles measures relational preservation.
+**Process (Qwen 0.8B vs 4B):** PCA 4B embeddings (2560→1024), Procrustes alignment using 10K anchor tokens, evaluate on 5K held-out tokens.
+| Comparison | Relational Pearson | Digit Structure Pearson |
+|---|---|---|
+| Qwen 0.8B vs 4B (raw) | 0.920 | 0.904 |
+| Qwen 0.8B vs 4B (Procrustes) | higher (post-alignment) | — |
+**Finding:** Models at different scales learn the same relational geometry (r=0.92).
+---
+## IV. Semantic Structure Metrics
+### IV.1 Digit Manifold
+**Formula:** For digit tokens '0'–'9', compute all 45 pairwise cosines. Measure Pearson correlation between |i−j| (numerical distance) and cosine similarity.
+**Process:** Encode each digit as single token, extract embedding, normalize, compute pairwise cosine matrix.
+| Model | |i−j| Correlation | Adjacent Mean | Non-Adjacent Mean | Gap |
+|---|---|---|---|---|
+| T5-Small | -0.575 | 0.622 | 0.442 | 0.180 |
+| Qwen3.5-0.8B | -0.862 | 0.769 | 0.678 | 0.091 |
+| Qwen3.5-4B | -0.871 | 0.790 | 0.731 | 0.059 |
+**Finding:** All models encode a number line. Stronger in Qwen (more training data). T5 has wider gap (adjacent vs non-adjacent more differentiated) despite weaker overall correlation.
+### IV.2 Semantic Category Clustering
+**Formula:** For tokens in a semantic category, compute mean intra-category pairwise cosine. Compare to global mean pairwise cosine. Lift = intra − global.
+**Process (T5-Small):** 8 hand-curated categories (animals, colors, numbers, body, food, emotions, actions, time), single-token words only.
+| Category | N tokens | Intra Cosine | Global | Lift |
+|---|---|---|---|---|
+| numbers | 9 | 0.497 | 0.057 | +0.440 |
+| colors | 10 | 0.421 | 0.057 | +0.365 |
+| time | 10 | 0.351 | 0.057 | +0.294 |
+| food | 10 | 0.248 | 0.057 | +0.191 |
+| animals | 12 | 0.241 | 0.057 | +0.184 |
+| body | 10 | 0.216 | 0.057 | +0.159 |
+| emotions | 10 | 0.197 | 0.057 | +0.141 |
+| actions | 9 | 0.183 | 0.057 | +0.126 |
+---
+## V. Encoder Transformation Metrics (T5-Small)
+### V.1 Layer-by-Layer Geometry
+**Process:** Feed 10 diverse sentences through encoder, capture hidden states at each layer. Measure mean norm and mean pairwise cosine between token positions.
+| Layer | Mean Norm | Pairwise Cosine |
+|---|---|---|
+| 0 (embed) | 377.3 | 0.052 |
+| 1 | 761.6 | 0.278 |
+| 2 | 1092.6 | 0.330 |
+| 3 | 1428.8 | 0.367 |
+| 4 | 1829.1 | 0.382 |
+| 5 | 2378.3 | 0.419 |
+| 6 (post-LN) | 3.3 | 0.211 |
+**Finding:** Norms balloon through depth, final LayerNorm crushes to ~3. Pairwise cosine increases monotonically — tokens become MORE similar through depth. The encoder is a convergence funnel.
+### V.2 WordNet Relational Alignment
+**Process:** Encode 9,362 WordNet definitions via "summarize: {definition}". Mean-pool encoder output. Compare pairwise cosine to WordNet path similarity.
+| Representation | Pearson | Spearman |
+|---|---|---|
+| Static embeddings | 0.078 | 0.015 |
+| Encoder output | 0.095 | 0.081 |
+**50-seed stability (encoder):** Pearson 0.100 ± 0.008, Spearman 0.090 ± 0.010, CV 0.204 ± 0.006.
+### V.3 Encoder Distance Bands
+**Process:** Group WordNet token pairs by path similarity ranges. Measure mean cosine in each band.
+| WN Similarity Band | N pairs | Static Cosine | Encoder Cosine | Lift |
+|---|---|---|---|---|
+| [0.50, 0.90) | 23 | 0.244 | 0.728 | +0.484 |
+| [0.25, 0.50) | 53,112 | 0.077 | 0.573 | +0.496 |
+| [0.10, 0.25) | 145,035 | 0.060 | 0.565 | +0.505 |
+| [0.05, 0.10) | 295,680 | 0.061 | 0.553 | +0.492 |
+### V.4 Hypernym Chain Decay
+**Process:** Find WordNet synsets forming hypernym chains (e.g., dog→canine→mammal→organism). Measure cosine between root and ancestor at each depth.
+| Depth | Static Cosine | Encoder Cosine |
+|---|---|---|
+| 1 | 0.160 | 0.656 |
+| 2 | 0.090 | 0.620 |
+| 3 | 0.075 | 0.594 |
+| 5 | 0.069 | 0.585 |
+| 7 | 0.068 | 0.579 |
+**Finding:** Monotonic decay in both spaces. Encoder has much stronger signal and cleaner gradient.
+---
+## VI. Inactive Weight Topology (T5-Small / T5-Base)
+### VI.1 SVD Effective Rank
+**Formula:** Stable rank = ‖W‖²_F / ‖W‖²₂ = Σσᵢ² / σ₁². Measures effective rank without thresholding.
+**Process:** SVD every 2D weight matrix. Report stable rank, participation ratio, active fraction (σᵢ > 0.01·σ₁), and condition number (σ₁/σₙ).
+| Weight Type | Stable Rank (Small) | Stable Rank (Base) |
+|---|---|---|
+| self_attn_q | 47.6 ± 16.4 | 58.1 ± 17.2 |
+| self_attn_k | 53.2 ± 9.2 | 62.4 ± 18.3 |
+| self_attn_v | 75.3 | 97.5 |
+| mlp_wi | 15.2 ± 3.8 | 20.6 ± 4.9 |
+| mlp_wo | 31.3 | 43.9 |
+### VI.2 Sparsity Topology
+**Formula:** Fraction of |wᵢⱼ| below threshold.
+| Weight Type | <0.1 (Small) | <0.1 (Base) |
+|---|---|---|
+| self_attn_q | **93.7%** | **99.4%** |
+| self_attn_k | 19.2% | 30.0% |
+| self_attn_v | 12.1% | 16.2% |
+| mlp_wi | 11.9% | 16.9% |
+| Full model | 18.4% | 27.9% |
+**Finding:** Q matrices are overwhelmingly sparse. The query projection is >93% empty. K matrices are dense. This asymmetry grows with scale. The Q null space is the intervention point for geometric modulation.
+### VI.3 QK Similarity Manifold
+**Formula:** QK = W_Q · W_Kᵀ. Eigendecompose the symmetric part (QK + QKᵀ)/2. Positive eigenvalues = attraction directions. Negative eigenvalues = repulsion directions.
+**Process:** Compute per-layer. Track positive/negative balance and stable rank.
+| Layer (Encoder) | Stable Rank | Positive Eig | Negative Eig | Symmetry Dev |
+|---|---|---|---|---|
+| 0 | 39.5 | 315 | 197 | 0.993 |
+| 2 | 10.1 | 269 | 243 | 1.217 |
+| 5 | 5.35 | 274 | 238 | 1.252 |
+**Finding:** Similarity function narrows through depth (stable rank 39→5). Negative eigenvalue count increases — deeper layers define more anti-similarity boundaries.
+### VI.4 MLP Dead Neurons
+**Formula:** Combined importance = ‖wᵢ_up‖₂ · ‖wᵢ_down‖₂. Dead if < 1% of mean.
+**Finding:** Zero dead neurons across all layers, both encoder and decoder, at both Small and Base scale. T5 is parameter-starved — every neuron earns its keep.
+### VI.5 Position Bias Topology
+**Process:** T5 uses learned relative position biases: [32 buckets, N heads]. Measure per-head: monotonicity, distance correlation, peak bucket.
+**Encoder (T5-Small):** 3 local heads (peak 0-1, negative dist_corr), 2 global heads (peak 17-18, positive dist_corr), 3 mixed.
+**Decoder (T5-Small):** 4 far-looking heads (peak 31, values up to +48), 4 local heads (peak 0-1, values down to -34.5). Extreme magnitude asymmetry — far-looking heads are 10× stronger.
+**Finding:** This local/global split emerges identically across T5-Small, T5-Base. It's an architectural invariant.
+---
+## VII. Geometric Residual Modulator
+### VII.1 Architecture
+- Geometric embedding: [vocab_size, 64] — per-token geometric fingerprint
+- Projection: Linear(64, d_model, bias=False) — Procrustes-aligned to encoder PCA space
+- Alpha: per-layer learnable LERP coefficient, stored in logit space, applied via sigmoid
+- Intervention: residual_out = (1 − α) · residual + α · proj(geo_embed(token_ids))
+- Params: 2.09M (3.45% of T5-Small)
+### VII.2 Geometric Embedding Initialization
+**Process:**
+1. Build 3000×3000 Wu-Palmer similarity matrix from WordNet anchors (~6 min)
+2. Eigendecompose → top 64 eigenvectors scaled by √eigenvalue → 64-d embeddings
+3. Project remaining tokens via GPU embedding cosine proxy (10-NN, softmax-weighted, <1 sec)
+4. Procrustes align projection matrix to encoder PCA space
+| Metric | Value |
+|---|---|
+| WN reconstruction correlation | 0.921 |
+| Procrustes alignment cosine | 0.372 |
+| Eigenvalue cumulative (top 64) | 61.3% |
+### VII.3 Alpha Convergence
+**Process:** Freeze T5, train only modulator (geometric embed + projection + alpha). Task: summarize definition → lemma word. Track alpha per layer.
+| Start α | Final Mean α | Layer 5 Final | Pearson Δ | CV | Coherent | Basin |
+|---|---|---|---|---|---|---|
+| 0.01 (20 ep) | **0.067** | **0.107** | **+0.151** | **0.220** | **Yes** | Binding |
+| 0.20 (20 ep) | 0.222 | 0.308 | +0.085 | 0.452 | No | Ridge |
+| 0.70 (20 ep) | 0.695 | 0.640 | -0.029 | 0.482 | No | Separation |
+| 0.01 (100 ep) | 0.125 | 0.218 | +0.074 | 0.322 | No | Overfit |
+**Finding:** Two stable attractor basins exist — binding (~0.07) and separation (~0.70). The binding basin produces functional results. Starting at 0.01 with early stopping (20 epochs) is optimal.
+### VII.4 Depth Gradient (Consistent Across All Runs)
+| Layer | 20ep (α=0.01) | 100ep (α=0.01) | 20ep (α=0.20) |
+|---|---|---|---|
+| 0 | 0.015 | 0.035 | 0.170 |
+| 1 | 0.052 | 0.061 | 0.180 |
+| 2 | 0.066 | 0.102 | 0.227 |
+| 3 | 0.080 | 0.137 | 0.197 |
+| 4 | 0.080 | 0.197 | 0.248 |
+| 5 | 0.107 | 0.218 | 0.308 |
+**Finding:** Always monotonically increasing. The model wants minimal geometric modulation early and maximum modulation at the deepest layer. Geometry is a final correction, not an initial condition.
+### VII.5 Best Result
+| Metric | Original | Modulated (20ep, α=0.01 start) | Change |
+|---|---|---|---|
+| WordNet Pearson | 0.099 | **0.250** | **+152%** |
+| WordNet Spearman | 0.085 | **0.245** | **+189%** |
+| Semantic Gradient | 0.022 | **0.052** | **+132%** |
+| Pentachoron CV | 0.202 | **0.220** | Stayed in band |
+| Per-token Preservation | — | 0.730 | — |
+| Coherence | Baseline | **Identical on 4/4 tests** | — |
+---
+## VIII. The 0.29154 Constant
+### VIII.1 Observations Across Systems
+| System | Context | Value |
+|---|---|---|
+| MinimalShunts | CLIP-L ↔ CLIP-G projection gate | Emergent equilibrium |
+| Wormhole Lambda | Vision transformer training | Converges from 0.74 toward ~0.29 |
+| Alpha curriculum | Devil's Staircase PE training | Converges to ~0.50 under geometric loss, CE destroys |
+| T5 generation | Greedy decode alpha sweep | Stable plateau at 0.291–0.292, semantic phase transition |
+### VIII.2 T5 Generation Phase Transition
+| Alpha | Output (triangle prompt) |
+|---|---|
+| 0.01–0.10 | "triangle is a polygon with three edges and three vertices. it is one of the basic shapes in geometry." |
+| 0.20 | "**a** triangle is a polygon with three edges and three vertices..." |
+| 0.28 | "a polygon with three vertices. it is one of the basic shapes in **a graph**." |
+| 0.291 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
+| 0.2915 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **a graph**." |
+| 0.292 | "a triangle is a polygon with a vertice and a vertice. it is one of the basic shapes in **the world**." |
+| 0.30 | "a polygon with a vertice and a vertice. it is one of the basic shapes in the world." |
+**Finding:** 0.29154 marks the phase boundary between structural representation ("graph") and physical representation ("world"). Output is invariant to perturbation in a narrow band centered on the constant.
+---
+## IX. Universal Geometric Constants
+| Constant | Value | Observed In |
+|---|---|---|
+| Pentachoron CV | 0.20–0.23 | T5-Small, Qwen 0.8B, Qwen 4B, trained modulator |
+| Participation / dim | 0.53–0.56 | T5-Small, Qwen 0.8B |
+| Binding/separation constant | 0.29154 / 0.70846 | MinimalShunts, CLIP projections, T5 generation, alpha convergence |
+| Depth gradient | Monotonic increasing | All modulator training runs |
+| Q sparsity scaling | Increases with model scale | T5-Small (93.7%), T5-Base (99.4%) |
+---
+## X. Measurement Toolkit Reference
+| Tool | Input | Output | Requires Inference |
+|---|---|---|---|
+| Participation Ratio | Embedding matrix | Effective dimensionality | No |
+| Cayley-Menger Volume | 5-point subsets of embeddings | Simplex volume + CV | No |
+| Pairwise Cosine | Embedding matrix (sampled) | Similarity distribution | No |
+| Digit Manifold | 10 digit token embeddings | |i−j| correlation, adjacency gap | No |
+| SVD Effective Rank | Any 2D weight matrix | Stable rank, condition number | No |
+| QK Manifold | W_Q, W_K matrices | Eigenspectrum, pos/neg balance | No |
+| Dead Neuron Count | MLP wi, wo matrices | Combined importance distribution | No |
+| WordNet Relational | Encoder output (mean-pooled) | Pearson/Spearman vs path similarity | Yes |
+| Alpha Convergence | Modulator training loop | Per-layer equilibrium values | Yes (training) |
+---
+*Last updated: 2026-03-05*
+*Models profiled: 3 (T5-Small, Qwen3.5-0.8B, Qwen3.5-4B)*
+*Modulator experiments: 4 configurations*