File size: 15,118 Bytes
6fb2607 36af25f 402d379 36af25f 402d379 36af25f d8a6cd0 36af25f 402d379 36af25f d8a6cd0 36af25f d8a6cd0 fa015c1 d8a6cd0 fa015c1 d8a6cd0 fa015c1 d8a6cd0 fa015c1 d8a6cd0 fa015c1 d8a6cd0 fa015c1 36af25f 402d379 36af25f d8a6cd0 6fb2607 36af25f 6fb2607 36af25f 6fb2607 36af25f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 | ---
license: apache-2.0
---
# GeoLIP Deep Embedding Analysis
The first battery is complete. The JSON shows all the embedding sizes that exist within the CV band.
Parse the [sweep.json](https://huggingface.co/AbstractPhil/geolip-deep-embedding-analysis/resolve/main/cv_sweep.json) and find your model attention bands related
to your embedding spaces. This can be used as a differentiation utility to determine how much downstream task is required to compensate for the embedding,
how much should be reduced, how many layers your embeddings can propagate to, and the effective geometric range of those in conjunction.
## What This Measures
The Cayley-Menger determinant computes the squared volume of a 4-simplex (pentachoron) formed by 5 randomly sampled embedding vectors. The coefficient of variation (CV) of these volumes across many random samples reveals the geometric operating regime of the embedding space.
- **CV > 0.30**: Volatile β simplex volumes vary wildly, geometric measurements are unstable
- **0.13 < CV < 0.30**: Band-valid β volumes carry discriminative structural information
- **CV < 0.13**: Degenerate β all simplices look identical, the measurement is blind
The band exists as a function of embedding dimension only. Vocabulary size is irrelevant. Training signal does not move CV β it is a property of the ambient dimensionality.
## Key Findings
| Dimension | Avg CV | Band Status |
|-----------|--------|-------------|
| D=8 | 0.605 | Above band β volatile |
| D=16 | 0.383 | Above band β entering |
| D=24 | 0.304 | **Phase boundary β binding constant 0.29154** |
| **D=32** | **0.257** | **Center of band** |
| **D=40** | **0.229** | **Center of band** |
| **D=48** | **0.207** | **Center of band** |
| **D=56** | **0.192** | **In band** |
| **D=64** | **0.180** | **In band** |
| **D=72** | **0.168** | **In band** |
| **D=80** | **0.159** | **In band** |
| **D=88** | **0.152** | **In band** |
| **D=96** | **0.144** | **In band** |
| **D=104** | **0.139** | **In band** |
| **D=112** | **0.134** | **In band** |
| D=120 | 0.129 | Below band β exiting |
| D=128 | 0.125 | Below band |
| D=256 | 0.088 | Degenerate |
| D=512 | 0.063 | Degenerate |
| D=768 | 0.051 | Degenerate |
The standard MHA convention of 64 dims per head sits inside the band. This may be a direct causal relationship β the matmul scaling principle in attention operates at the dimensionality where simplex geometry remains discriminative.
## Sweep Data
```json
{
"sweep": {"step": 8, "low": 8, "high": 2048},
"band": {"lo": 0.13, "hi": 0.30},
"band_results": [... 3014 entries sorted by CV ...],
"all_results": [... 65536 entries ...]
}
```
Each entry: `{"V": vocab_size, "D": dim, "CV": value, "in_band": bool}`
## Download and Nearest Dimensional Lookup
```python
import json
import urllib.request
URL = "https://huggingface.co/AbstractPhil/geolip-deep-embedding-analysis/resolve/main/cv_sweep.json"
def load_sweep(path=None):
"""Load sweep from local path or download from HF."""
if path:
with open(path) as f:
return json.load(f)
with urllib.request.urlopen(URL) as r:
return json.loads(r.read().decode())
def nearest_band_dim(target_dim, sweep=None):
"""Find the nearest band-valid dimension to your model's embedding dim.
Returns the closest D where CV is in band, plus the expected CV range.
Use this to determine compartment size for patchwork decomposition.
Example: Your model uses D=768. This tells you to decompose into
compartments of D=32 (24 compartments) or D=64 (12 compartments).
"""
if sweep is None:
sweep = load_sweep()
# Build D -> CV stats from band_results
by_dim = {}
for r in sweep["band_results"]:
d = r["D"]
if d not in by_dim:
by_dim[d] = []
by_dim[d].append(r["CV"])
band_dims = sorted(by_dim.keys())
if not band_dims:
return None
# Find nearest
nearest = min(band_dims, key=lambda d: abs(d - target_dim))
# Also find best decompositions of target_dim
decompositions = []
for d in band_dims:
if target_dim % d == 0:
n_compartments = target_dim // d
cvs = by_dim[d]
decompositions.append({
"compartment_dim": d,
"n_compartments": n_compartments,
"cv_min": round(min(cvs), 4),
"cv_max": round(max(cvs), 4),
"cv_avg": round(sum(cvs) / len(cvs), 4),
})
cvs = by_dim[nearest]
return {
"target_dim": target_dim,
"nearest_band_dim": nearest,
"cv_range": [round(min(cvs), 4), round(max(cvs), 4)],
"cv_avg": round(sum(cvs) / len(cvs), 4),
"valid_decompositions": sorted(decompositions, key=lambda x: x["compartment_dim"]),
}
# ββ Usage ββ
if __name__ == "__main__":
for model_dim in [768, 1024, 512, 384, 256, 128]:
result = nearest_band_dim(model_dim)
print(f"\n{'='*50}")
print(f"Model dim: {model_dim}")
print(f"Nearest band dim: D={result['nearest_band_dim']} CV={result['cv_avg']:.4f}")
if result["valid_decompositions"]:
print(f"Valid decompositions:")
for dec in result["valid_decompositions"]:
print(f" {dec['n_compartments']:3d} Γ D={dec['compartment_dim']:3d} "
f"CV={dec['cv_avg']:.4f} [{dec['cv_min']:.4f}-{dec['cv_max']:.4f}]")
else:
print(f" No exact decompositions β consider padding or truncating")
```
## Parse and Filter
```python
import json
with open("cv_sweep.json") as f:
data = json.load(f)
# Filter for any CV range β example: binding constant region
lo, hi = 0.290, 0.292
hits = [e for e in data["band_results"] if lo <= e["CV"] <= hi]
hits.sort(key=lambda x: x["CV"])
print(f"CV in [{lo}, {hi}]: {len(hits)} entries")
for h in hits:
print(f" V={h['V']:6d} D={h['D']:4d} CV={h['CV']:.4f}")
# Group by D
dims = {}
for h in hits:
dims.setdefault(h["D"], []).append(h)
for d in sorted(dims):
entries = dims[d]
print(f" D={d:3d}: {len(entries)} entries "
f"CV={min(e['CV'] for e in entries):.4f}-{max(e['CV'] for e in entries):.4f}")
```
## Rescale and Sort
```python
def rescale_sort(sweep=None, group_by="dim"):
"""Sort and group sweep results for analysis.
group_by: 'dim' groups by embedding dimension (recommended)
'cv' groups into cv quartiles within band
'ratio' groups by V/D ratio
"""
if sweep is None:
sweep = load_sweep()
band_lo = sweep["band"]["lo"]
band_hi = sweep["band"]["hi"]
results = [r for r in sweep["all_results"] if r["CV"] is not None]
if group_by == "dim":
# Group by D, show band status and CV statistics
by_dim = {}
for r in results:
d = r["D"]
if d not in by_dim:
by_dim[d] = {"in_band": [], "below": [], "above": []}
if r["CV"] > band_hi:
by_dim[d]["above"].append(r["CV"])
elif r["CV"] < band_lo:
by_dim[d]["below"].append(r["CV"])
else:
by_dim[d]["in_band"].append(r["CV"])
table = []
for d in sorted(by_dim.keys()):
g = by_dim[d]
all_cvs = g["in_band"] + g["below"] + g["above"]
avg = sum(all_cvs) / len(all_cvs)
table.append({
"D": d,
"avg_cv": round(avg, 4),
"in_band_pct": round(100 * len(g["in_band"]) / len(all_cvs), 1),
"n_total": len(all_cvs),
"n_in_band": len(g["in_band"]),
"status": "IN_BAND" if band_lo < avg < band_hi else
"ABOVE" if avg >= band_hi else "BELOW",
})
return table
elif group_by == "cv":
# Quartile analysis within band
band = [r for r in results if band_lo < r["CV"] < band_hi]
if not band:
return []
band.sort(key=lambda r: r["CV"])
n = len(band)
return {
"total_in_band": n,
"q1_low": [r for r in band[:n//4]],
"q2_mid_low": [r for r in band[n//4:n//2]],
"q3_mid_high": [r for r in band[n//2:3*n//4]],
"q4_high": [r for r in band[3*n//4:]],
"q1_cv_range": [round(band[0]["CV"], 4), round(band[n//4-1]["CV"], 4)],
"q2_cv_range": [round(band[n//4]["CV"], 4), round(band[n//2-1]["CV"], 4)],
"q3_cv_range": [round(band[n//2]["CV"], 4), round(band[3*n//4-1]["CV"], 4)],
"q4_cv_range": [round(band[3*n//4]["CV"], 4), round(band[-1]["CV"], 4)],
}
elif group_by == "ratio":
# Group by V/D ratio β demonstrates V irrelevance
band = [r for r in results if band_lo < r["CV"] < band_hi]
by_ratio = {}
for r in band:
ratio = round(r["V"] / r["D"], 1)
if ratio not in by_ratio:
by_ratio[ratio] = []
by_ratio[ratio].append(r)
return {k: {"count": len(v), "dims": sorted(set(r["D"] for r in v))}
for k, v in sorted(by_ratio.items())}
# ββ Usage ββ
if __name__ == "__main__":
table = rescale_sort(group_by="dim")
print(f"{'D':>5} {'Avg CV':>8} {'Band%':>6} {'Status'}")
print("-" * 40)
for row in table:
if row["D"] <= 256:
print(f"{row['D']:5d} {row['avg_cv']:8.4f} {row['in_band_pct']:5.1f}% {row['status']}")
```
## The Binding Constant is D=24
Filtering the sweep for CV in [0.290, 0.292] β the region around the empirically observed binding constant 0.29154 β returns 12 entries:
| V | D | CV |
|---|---|-----|
| 24 | 16 | 0.2900 |
| 368 | 32 | 0.2903 |
| 1632 | 24 | 0.2906 |
| 208 | 24 | 0.2908 |
| 1096 | 24 | 0.2911 |
| 1992 | 24 | 0.2911 |
| 200 | 24 | 0.2914 |
| 1024 | 24 | 0.2916 |
| 760 | 24 | 0.2917 |
| 1232 | 24 | 0.2917 |
| 776 | 24 | 0.2919 |
| 904 | 24 | 0.2920 |
10 of 12 entries are D=24. The binding constant 0.29154 is the native CV of a 24-dimensional embedding space. It is not a learned value. It is not an empirical coincidence. It is the geometric fingerprint of D=24.
## The Computational Boundary
D=24 is also the exact dimension where custom SVD kernels hit an 8x performance cliff and eigendecomposition (eigh) collapses. The binding constant marks a dual boundary:
- **Geometric**: the phase transition between volatile simplex volumes (above 0.30) and discriminative geometry (below 0.30)
- **Computational**: the resolution limit of compact spectral decomposition kernels
Every time the constant 0.29154 appeared across 17+ pretrained models, the system was measuring the dimensional fingerprint of its own computational ceiling. The constellation encoded this ceiling as a structural constant because it could not compute past it.
D=32 is the first dimension past this wall that remains in band (CV ~0.257). Operating there requires `torch.linalg.det` on a 6Γ6 CM matrix β which compiles regardless of embedding dimension, because the CM matrix is always 6Γ6 for five-point simplices. The pairwise distances are computed via gram matrix (batched matmul, compiles perfectly). Only the `det` call touches linalg, and 6Γ6 is well within kernel range.
## MHA Activation Geometry
Measuring CV on per-head Q/K/V **activations** (not weights) after training reveals head_dim-dependent geometric behavior:
| head_dim | Q activation CV | K activation CV | V activation CV |
|----------|----------------|----------------|----------------|
| 64 | ~0.32 | ~0.42 | ~0.41 |
| 32 | ~0.38 | ~0.45 | ~0.43 |
| 16 | ~0.48 | ~0.70 | ~0.53 |
| 8 | ~0.65 | ~0.77 | ~0.63 |
Key observations:
- **Embedding activations are always in band** (CV 0.19β0.30) regardless of nominal D β training compresses effective dimensionality into band
- **K activations are asymmetrically volatile** β keys spread further than queries to make attention discriminative
- **Q activations track head_dim** following the same curve as the embedding sweep β the 64-dim convention keeps Q near band edge
- **The Q/K ratio** measures selectivity pressure: too high = brittle attention, too close to 1.0 = uniform attention
These ratios can be used as a zero-cost diagnostic on any pretrained transformer: forward one batch, measure per-head activation CV, and immediately identify which heads are geometrically healthy vs collapsing.
## Vocabulary Independence
CV at D=32 was verified from V=32 to V=13,000,000. The result is invariant:
```
V= 32 D=32 CV=0.2578
V= 512 D=32 CV=0.2615
V= 8,192 D=32 CV=0.2578
V= 65,536 D=32 CV=0.2663
V= 131,072 D=32 CV=0.2590
V= 500,000 D=32 CV=0.2745
V= 1,000,000 D=32 CV=0.2645
V= 4,000,000 D=32 CV=0.2541
V=13,000,000 D=32 CV=0.2681
```
Vocabulary size does not gate band membership. The CM determinant samples 5 points β the distribution of simplex volumes depends on ambient dimensionality, not on the number of points in the space.
## Implications for Architecture Design
The band is not a training outcome. It is a geometric property of dimensionality. This means:
1. **Embedding compartments must be D=32 to D=64** for Cayley-Menger volumes to carry discriminative information
2. **A 768-dim model** should decompose into 24Γ32 or 12Γ64 compartments, not operate as a monolithic vector
3. **The standard 64-dim attention head** may exist precisely because it sits inside this geometric band
4. **Scaling** comes from composing band-valid units with geometric linkages, not from widening dimensions beyond the band
5. **D=24 (CV=0.29154)** is the phase boundary β any component pushed above this threshold has crossed from structured into volatile geometry
6. **The 6Γ6 CM determinant compiles** at any embedding dimension β the computational bottleneck was in spectral decomposition, not in the geometric measurement itself
## Reproducing
```python
# The sweep script that generated this data
# Requires: torch
import torch, torch.nn as nn, torch.nn.functional as F, math, json
def cayley_menger_vol2(points):
B, N, D = points.shape
gram = torch.bmm(points, points.transpose(1, 2))
norms = torch.diagonal(gram, dim1=1, dim2=2)
d2 = F.relu(norms.unsqueeze(2) + norms.unsqueeze(1) - 2 * gram)
cm = torch.zeros(B, N+1, N+1, device=points.device, dtype=points.dtype)
cm[:, 0, 1:] = 1; cm[:, 1:, 0] = 1; cm[:, 1:, 1:] = d2
k = N - 1
return ((-1)**(k+1)) * torch.linalg.det(cm.float()).to(points.dtype) / ((2**k) * (math.factorial(k)**2))
def cv_metric(weight, n_samples=300):
V, D = weight.shape
pool = min(V, 512)
idx = torch.stack([torch.randperm(pool)[:5] for _ in range(n_samples)])
vol2 = cayley_menger_vol2(weight[:pool][idx])
valid = vol2 > 1e-20
if valid.sum() < 10: return None
vols = vol2[valid].sqrt()
return (vols.std() / (vols.mean() + 1e-8)).item()
```
## Citation
Part of the [GeoLIP](https://huggingface.co/AbstractPhil) geometric deep learning research. |