| --- |
| license: apache-2.0 |
| --- |
| |
| # GeoLIP Deep Embedding Analysis |
|
|
| The first battery is complete. The JSON shows all the embedding sizes that exist within the CV band. |
|
|
| Parse the [sweep.json](https://huggingface.co/AbstractPhil/geolip-deep-embedding-analysis/resolve/main/cv_sweep.json) and find your model attention bands related |
| to your embedding spaces. This can be used as a differentiation utility to determine how much downstream task is required to compensate for the embedding, |
| how much should be reduced, how many layers your embeddings can propagate to, and the effective geometric range of those in conjunction. |
|
|
| ## What This Measures |
|
|
| The Cayley-Menger determinant computes the squared volume of a 4-simplex (pentachoron) formed by 5 randomly sampled embedding vectors. The coefficient of variation (CV) of these volumes across many random samples reveals the geometric operating regime of the embedding space. |
|
|
| - **CV > 0.30**: Volatile β simplex volumes vary wildly, geometric measurements are unstable |
| - **0.13 < CV < 0.30**: Band-valid β volumes carry discriminative structural information |
| - **CV < 0.13**: Degenerate β all simplices look identical, the measurement is blind |
|
|
| The band exists as a function of embedding dimension only. Vocabulary size is irrelevant. Training signal does not move CV β it is a property of the ambient dimensionality. |
|
|
| ## Key Findings |
|
|
| | Dimension | Avg CV | Band Status | |
| |-----------|--------|-------------| |
| | D=8 | 0.605 | Above band β volatile | |
| | D=16 | 0.383 | Above band β entering | |
| | D=24 | 0.304 | **Phase boundary β binding constant 0.29154** | |
| | **D=32** | **0.257** | **Center of band** | |
| | **D=40** | **0.229** | **Center of band** | |
| | **D=48** | **0.207** | **Center of band** | |
| | **D=56** | **0.192** | **In band** | |
| | **D=64** | **0.180** | **In band** | |
| | **D=72** | **0.168** | **In band** | |
| | **D=80** | **0.159** | **In band** | |
| | **D=88** | **0.152** | **In band** | |
| | **D=96** | **0.144** | **In band** | |
| | **D=104** | **0.139** | **In band** | |
| | **D=112** | **0.134** | **In band** | |
| | D=120 | 0.129 | Below band β exiting | |
| | D=128 | 0.125 | Below band | |
| | D=256 | 0.088 | Degenerate | |
| | D=512 | 0.063 | Degenerate | |
| | D=768 | 0.051 | Degenerate | |
|
|
| The standard MHA convention of 64 dims per head sits inside the band. This may be a direct causal relationship β the matmul scaling principle in attention operates at the dimensionality where simplex geometry remains discriminative. |
|
|
| ## Sweep Data |
|
|
| ```json |
| { |
| "sweep": {"step": 8, "low": 8, "high": 2048}, |
| "band": {"lo": 0.13, "hi": 0.30}, |
| "band_results": [... 3014 entries sorted by CV ...], |
| "all_results": [... 65536 entries ...] |
| } |
| ``` |
|
|
| Each entry: `{"V": vocab_size, "D": dim, "CV": value, "in_band": bool}` |
|
|
| ## Download and Nearest Dimensional Lookup |
|
|
| ```python |
| import json |
| import urllib.request |
| |
| URL = "https://huggingface.co/AbstractPhil/geolip-deep-embedding-analysis/resolve/main/cv_sweep.json" |
| |
| def load_sweep(path=None): |
| """Load sweep from local path or download from HF.""" |
| if path: |
| with open(path) as f: |
| return json.load(f) |
| with urllib.request.urlopen(URL) as r: |
| return json.loads(r.read().decode()) |
| |
| def nearest_band_dim(target_dim, sweep=None): |
| """Find the nearest band-valid dimension to your model's embedding dim. |
| |
| Returns the closest D where CV is in band, plus the expected CV range. |
| Use this to determine compartment size for patchwork decomposition. |
| |
| Example: Your model uses D=768. This tells you to decompose into |
| compartments of D=32 (24 compartments) or D=64 (12 compartments). |
| """ |
| if sweep is None: |
| sweep = load_sweep() |
| |
| # Build D -> CV stats from band_results |
| by_dim = {} |
| for r in sweep["band_results"]: |
| d = r["D"] |
| if d not in by_dim: |
| by_dim[d] = [] |
| by_dim[d].append(r["CV"]) |
| |
| band_dims = sorted(by_dim.keys()) |
| if not band_dims: |
| return None |
| |
| # Find nearest |
| nearest = min(band_dims, key=lambda d: abs(d - target_dim)) |
| |
| # Also find best decompositions of target_dim |
| decompositions = [] |
| for d in band_dims: |
| if target_dim % d == 0: |
| n_compartments = target_dim // d |
| cvs = by_dim[d] |
| decompositions.append({ |
| "compartment_dim": d, |
| "n_compartments": n_compartments, |
| "cv_min": round(min(cvs), 4), |
| "cv_max": round(max(cvs), 4), |
| "cv_avg": round(sum(cvs) / len(cvs), 4), |
| }) |
| |
| cvs = by_dim[nearest] |
| return { |
| "target_dim": target_dim, |
| "nearest_band_dim": nearest, |
| "cv_range": [round(min(cvs), 4), round(max(cvs), 4)], |
| "cv_avg": round(sum(cvs) / len(cvs), 4), |
| "valid_decompositions": sorted(decompositions, key=lambda x: x["compartment_dim"]), |
| } |
| |
| |
| # ββ Usage ββ |
| |
| if __name__ == "__main__": |
| for model_dim in [768, 1024, 512, 384, 256, 128]: |
| result = nearest_band_dim(model_dim) |
| print(f"\n{'='*50}") |
| print(f"Model dim: {model_dim}") |
| print(f"Nearest band dim: D={result['nearest_band_dim']} CV={result['cv_avg']:.4f}") |
| if result["valid_decompositions"]: |
| print(f"Valid decompositions:") |
| for dec in result["valid_decompositions"]: |
| print(f" {dec['n_compartments']:3d} Γ D={dec['compartment_dim']:3d} " |
| f"CV={dec['cv_avg']:.4f} [{dec['cv_min']:.4f}-{dec['cv_max']:.4f}]") |
| else: |
| print(f" No exact decompositions β consider padding or truncating") |
| ``` |
|
|
| ## Parse and Filter |
|
|
| ```python |
| import json |
| |
| with open("cv_sweep.json") as f: |
| data = json.load(f) |
| |
| # Filter for any CV range β example: binding constant region |
| lo, hi = 0.290, 0.292 |
| hits = [e for e in data["band_results"] if lo <= e["CV"] <= hi] |
| hits.sort(key=lambda x: x["CV"]) |
| |
| print(f"CV in [{lo}, {hi}]: {len(hits)} entries") |
| for h in hits: |
| print(f" V={h['V']:6d} D={h['D']:4d} CV={h['CV']:.4f}") |
| |
| # Group by D |
| dims = {} |
| for h in hits: |
| dims.setdefault(h["D"], []).append(h) |
| for d in sorted(dims): |
| entries = dims[d] |
| print(f" D={d:3d}: {len(entries)} entries " |
| f"CV={min(e['CV'] for e in entries):.4f}-{max(e['CV'] for e in entries):.4f}") |
| ``` |
|
|
| ## Rescale and Sort |
|
|
| ```python |
| def rescale_sort(sweep=None, group_by="dim"): |
| """Sort and group sweep results for analysis. |
| |
| group_by: 'dim' groups by embedding dimension (recommended) |
| 'cv' groups into cv quartiles within band |
| 'ratio' groups by V/D ratio |
| """ |
| if sweep is None: |
| sweep = load_sweep() |
| |
| band_lo = sweep["band"]["lo"] |
| band_hi = sweep["band"]["hi"] |
| results = [r for r in sweep["all_results"] if r["CV"] is not None] |
| |
| if group_by == "dim": |
| # Group by D, show band status and CV statistics |
| by_dim = {} |
| for r in results: |
| d = r["D"] |
| if d not in by_dim: |
| by_dim[d] = {"in_band": [], "below": [], "above": []} |
| if r["CV"] > band_hi: |
| by_dim[d]["above"].append(r["CV"]) |
| elif r["CV"] < band_lo: |
| by_dim[d]["below"].append(r["CV"]) |
| else: |
| by_dim[d]["in_band"].append(r["CV"]) |
| |
| table = [] |
| for d in sorted(by_dim.keys()): |
| g = by_dim[d] |
| all_cvs = g["in_band"] + g["below"] + g["above"] |
| avg = sum(all_cvs) / len(all_cvs) |
| table.append({ |
| "D": d, |
| "avg_cv": round(avg, 4), |
| "in_band_pct": round(100 * len(g["in_band"]) / len(all_cvs), 1), |
| "n_total": len(all_cvs), |
| "n_in_band": len(g["in_band"]), |
| "status": "IN_BAND" if band_lo < avg < band_hi else |
| "ABOVE" if avg >= band_hi else "BELOW", |
| }) |
| return table |
| |
| elif group_by == "cv": |
| # Quartile analysis within band |
| band = [r for r in results if band_lo < r["CV"] < band_hi] |
| if not band: |
| return [] |
| band.sort(key=lambda r: r["CV"]) |
| n = len(band) |
| return { |
| "total_in_band": n, |
| "q1_low": [r for r in band[:n//4]], |
| "q2_mid_low": [r for r in band[n//4:n//2]], |
| "q3_mid_high": [r for r in band[n//2:3*n//4]], |
| "q4_high": [r for r in band[3*n//4:]], |
| "q1_cv_range": [round(band[0]["CV"], 4), round(band[n//4-1]["CV"], 4)], |
| "q2_cv_range": [round(band[n//4]["CV"], 4), round(band[n//2-1]["CV"], 4)], |
| "q3_cv_range": [round(band[n//2]["CV"], 4), round(band[3*n//4-1]["CV"], 4)], |
| "q4_cv_range": [round(band[3*n//4]["CV"], 4), round(band[-1]["CV"], 4)], |
| } |
| |
| elif group_by == "ratio": |
| # Group by V/D ratio β demonstrates V irrelevance |
| band = [r for r in results if band_lo < r["CV"] < band_hi] |
| by_ratio = {} |
| for r in band: |
| ratio = round(r["V"] / r["D"], 1) |
| if ratio not in by_ratio: |
| by_ratio[ratio] = [] |
| by_ratio[ratio].append(r) |
| return {k: {"count": len(v), "dims": sorted(set(r["D"] for r in v))} |
| for k, v in sorted(by_ratio.items())} |
| |
| |
| # ββ Usage ββ |
| |
| if __name__ == "__main__": |
| table = rescale_sort(group_by="dim") |
| print(f"{'D':>5} {'Avg CV':>8} {'Band%':>6} {'Status'}") |
| print("-" * 40) |
| for row in table: |
| if row["D"] <= 256: |
| print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}") |
| ``` |
|
|
| ## The Binding Constant is D=24 |
|
|
| Filtering the sweep for CV in [0.290, 0.292] β the region around the empirically observed binding constant 0.29154 β returns 12 entries: |
|
|
| | V | D | CV | |
| |---|---|-----| |
| | 24 | 16 | 0.2900 | |
| | 368 | 32 | 0.2903 | |
| | 1632 | 24 | 0.2906 | |
| | 208 | 24 | 0.2908 | |
| | 1096 | 24 | 0.2911 | |
| | 1992 | 24 | 0.2911 | |
| | 200 | 24 | 0.2914 | |
| | 1024 | 24 | 0.2916 | |
| | 760 | 24 | 0.2917 | |
| | 1232 | 24 | 0.2917 | |
| | 776 | 24 | 0.2919 | |
| | 904 | 24 | 0.2920 | |
|
|
| 10 of 12 entries are D=24. The binding constant 0.29154 is the native CV of a 24-dimensional embedding space. It is not a learned value. It is not an empirical coincidence. It is the geometric fingerprint of D=24. |
|
|
| ## The Computational Boundary |
|
|
| D=24 is also the exact dimension where custom SVD kernels hit an 8x performance cliff and eigendecomposition (eigh) collapses. The binding constant marks a dual boundary: |
|
|
| - **Geometric**: the phase transition between volatile simplex volumes (above 0.30) and discriminative geometry (below 0.30) |
| - **Computational**: the resolution limit of compact spectral decomposition kernels |
|
|
| Every time the constant 0.29154 appeared across 17+ pretrained models, the system was measuring the dimensional fingerprint of its own computational ceiling. The constellation encoded this ceiling as a structural constant because it could not compute past it. |
|
|
| D=32 is the first dimension past this wall that remains in band (CV ~0.257). Operating there requires `torch.linalg.det` on a 6Γ6 CM matrix β which compiles regardless of embedding dimension, because the CM matrix is always 6Γ6 for five-point simplices. The pairwise distances are computed via gram matrix (batched matmul, compiles perfectly). Only the `det` call touches linalg, and 6Γ6 is well within kernel range. |
|
|
| ## MHA Activation Geometry |
|
|
| Measuring CV on per-head Q/K/V **activations** (not weights) after training reveals head_dim-dependent geometric behavior: |
| |
| | head_dim | Q activation CV | K activation CV | V activation CV | |
| |----------|----------------|----------------|----------------| |
| | 64 | ~0.32 | ~0.42 | ~0.41 | |
| | 32 | ~0.38 | ~0.45 | ~0.43 | |
| | 16 | ~0.48 | ~0.70 | ~0.53 | |
| | 8 | ~0.65 | ~0.77 | ~0.63 | |
|
|
| Key observations: |
|
|
| - **Embedding activations are always in band** (CV 0.19β0.30) regardless of nominal D β training compresses effective dimensionality into band |
| - **K activations are asymmetrically volatile** β keys spread further than queries to make attention discriminative |
| - **Q activations track head_dim** following the same curve as the embedding sweep β the 64-dim convention keeps Q near band edge |
| - **The Q/K ratio** measures selectivity pressure: too high = brittle attention, too close to 1.0 = uniform attention |
| |
| These ratios can be used as a zero-cost diagnostic on any pretrained transformer: forward one batch, measure per-head activation CV, and immediately identify which heads are geometrically healthy vs collapsing. |
| |
| ## Vocabulary Independence |
| |
| CV at D=32 was verified from V=32 to V=13,000,000. The result is invariant: |
| |
| ``` |
| V= 32 D=32 CV=0.2578 |
| V= 512 D=32 CV=0.2615 |
| V= 8,192 D=32 CV=0.2578 |
| V= 65,536 D=32 CV=0.2663 |
| V= 131,072 D=32 CV=0.2590 |
| V= 500,000 D=32 CV=0.2745 |
| V= 1,000,000 D=32 CV=0.2645 |
| V= 4,000,000 D=32 CV=0.2541 |
| V=13,000,000 D=32 CV=0.2681 |
| ``` |
| |
| Vocabulary size does not gate band membership. The CM determinant samples 5 points β the distribution of simplex volumes depends on ambient dimensionality, not on the number of points in the space. |
| |
| ## Implications for Architecture Design |
| |
| The band is not a training outcome. It is a geometric property of dimensionality. This means: |
| |
| 1. **Embedding compartments must be D=32 to D=64** for Cayley-Menger volumes to carry discriminative information |
| 2. **A 768-dim model** should decompose into 24Γ32 or 12Γ64 compartments, not operate as a monolithic vector |
| 3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band |
| 4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band |
| 5. **D=24 (CV=0.29154)** is the phase boundary β any component pushed above this threshold has crossed from structured into volatile geometry |
| 6. **The 6Γ6 CM determinant compiles** at any embedding dimension β the computational bottleneck was in spectral decomposition, not in the geometric measurement itself |
| |
| ## Reproducing |
| |
| ```python |
| # The sweep script that generated this data |
| # Requires: torch |
| |
| import torch, torch.nn as nn, torch.nn.functional as F, math, json |
| |
| def cayley_menger_vol2(points): |
| B, N, D = points.shape |
| gram = torch.bmm(points, points.transpose(1, 2)) |
| norms = torch.diagonal(gram, dim1=1, dim2=2) |
| d2 = F.relu(norms.unsqueeze(2) + norms.unsqueeze(1) - 2 * gram) |
| cm = torch.zeros(B, N+1, N+1, device=points.device, dtype=points.dtype) |
| cm[:, 0, 1:] = 1; cm[:, 1:, 0] = 1; cm[:, 1:, 1:] = d2 |
| k = N - 1 |
| return ((-1)**(k+1)) * torch.linalg.det(cm.float()).to(points.dtype) / ((2**k) * (math.factorial(k)**2)) |
| |
| def cv_metric(weight, n_samples=300): |
| V, D = weight.shape |
| pool = min(V, 512) |
| idx = torch.stack([torch.randperm(pool)[:5] for _ in range(n_samples)]) |
| vol2 = cayley_menger_vol2(weight[:pool][idx]) |
| valid = vol2 > 1e-20 |
| if valid.sum() < 10: return None |
| vols = vol2[valid].sqrt() |
| return (vols.std() / (vols.mean() + 1e-8)).item() |
| ``` |
| |
| ## Citation |
| |
| Part of the [GeoLIP](https://huggingface.co/AbstractPhil) geometric deep learning research. |