Reading the Voids: Topological Contribution Signals in Frozen Geometric Codebooks
AbstractPhil · May 2026 · geolip-svae / geolip-core Methodology and ablation tooling developed with Mirel.
TL;DR
A frozen PatchSVAE battery induces a codebook — a small cloud of sign-canonicalized axes on projective space ℝPᴰ⁻¹. The shipped classifier that labels a codebook's "omega phase" reads only its H0 connectivity and discards everything else the topology probe computes, including the H1 loops and H2 voids of the axis cloud.
We turned every discarded quantity into an independently-toggleable contribution signal, fixed a metric-alignment bug (the probe measured persistence on the raw sphere, not the projective space the axes actually live on), and ablated all 15 signals across 41 frozen batteries spanning six model classes and three substrate types.
The headline result survives a dimension control that kills every geometric signal:
At fixed latent dimension, the H2 voids separate symbolic substrates from continuous ones — a distinction the geometry cannot make. Image codebooks are void-sparse; symbolic-vocabulary codebooks are void-rich. Voids are a substrate fingerprint.
We package the finding as omega_phase_v2, a two-axis taxonomy: a cross-dimension regime axis (the geometry) and a within-dimension void-character axis (the topology).
1. Background: what a codebook is
In the geolip-svae framework, a PatchSVAE "battery" is a tiny, frozen per-patch reconstruction codec. On a forward pass it emits a structured latent M of shape (B, N, V, D) — for each of N patches, a V × D frame whose rows live in a D-dimensional singular space.
The codebook is what you get by aggregating M across a calibration set, then collapsing antipodal row pairs (cosine < −0.9) into single representatives. The survivors are unit vectors, sign-canonicalized so that the first non-zero coordinate is positive — i.e. points on the real projective space ℝPᴰ⁻¹, where a vector and its antipode are the same point. This is the aleph convention: a codebook entry is an address (which axis) plus a sign bit (which antipode), and the system's discriminative readout, the AlephAssessor, reads M by that signed-projective address rather than by raw cosine.
A codebook is therefore a small point cloud — typically 20–250 axes — on a low-dimensional projective sphere. Its shape is a fingerprint of what the frozen battery learned to represent.
2. The gap: voids computed, then thrown away
The framework already analyzes that shape. run_topology_analysis runs three probes over the axis cloud: a kNN connectivity sweep, a local-intrinsic-dimension PCA, and — via ripser — persistent homology to dimension two, producing H0 (connected components), H1 (loops), and H2 (voids/cavities).
But the classifier that consumes this, _classify_omega_phase, looks at H0 only: how many components persist, whether they form pairs or clusters, where the giant component appears. The H1 loops and H2 voids are computed at full cost, written into the report JSON — and then read by nothing. No phase label, no telemetry, no decision depends on them.
That is the waste this work addresses. Reconstruction MSE tells you a codec is faithful. The codebook topology tells you how the latent is organized — and the voids are the part of that organization we were paying to compute and choosing to ignore. A void in the axis cloud is a region of representational space the vocabulary encircles but never occupies: a hole in what the battery can place there. The question is whether that hole means anything.
3. Method
3.1 A metric-alignment fix
Before adding any signal, one correction. The codebook axes are projective — compared everywhere else in the system by the folded angle arccos|cos θ| ∈ [0, π/2] on ℝPᴰ⁻¹, against the uniform-packing baseline uniform_projective_angle(D), within the envelope dev_critical(D) = 0.02·√D. But the stock topology probe computes persistence on the raw sphere angle arccos(cos θ) ∈ [0, π]. For sign-canonicalized representatives those are not the same metric, and loops/voids measured on the raw sphere double-count antipodal geometry the system treats as identified.
Every signal below is recomputed on the projective distance d(a,b) = arccos|⟨a,b⟩|, and deviation-type signals are normalized by dev_critical so that |x| > 1 means exactly what the architectural rigidity barrier means: out of envelope. The topology now lives in the same space as the aleph address.
3.2 Fifteen contribution signals
Each discarded quantity becomes a named signal with an explicit formula, the omega/aleph rule it preserves, and its intended use. Grouped:
Geometry / dispersion — proj_deviation (mean projective angle minus the uniform baseline), deviation_envelopes (the same in dev_critical units), angular_iqr (inter-quartile spread of pairwise projective angles).
Connectivity — percolation_ratio (the angle at which a giant component forms, relative to the uniform baseline), giant_frac_at_uniform (largest component fraction with the graph thresholded at the uniform baseline rather than an arbitrary degree).
Local geometry — local_dim_ratio (median participation-ratio dimension of each axis's k-NN offset cloud, divided by D: → 0 if axes lie on a curve, → 1 if they fill the tangent space).
Loops (H1) — betti1 (loop count per axis), h1_total_persistence, h1_max_persistence, h1_persistence_entropy.
Voids (H2) — betti2 (void count per axis), h2_total_persistence, h2_max_persistence, h2_persistence_entropy. The headline.
Aleph structure — pairing_fraction (fraction of rows that collapsed into antipodal pairs — the realization of the sign bit).
A subtle but decisive design point emerged during ablation (§5.1): count- and sum-type signals must be intensive. Raw Betti numbers and total persistence scale with the number of axes, so on a zoo spanning 16 to 256 axes their variance is dominated by codebook size, not structure. All four are normalized per-axis (betti2 / n_axes, total persistence / n_axes); max and entropy are already scale-free.
3.3 Ablation metrics
For each signal across the battery zoo we report three things, because they answer different questions:
- cv = std / |mean|: scale-free spread. Does the signal vary at all?
- |ρ| = |Spearman| with reconstruction MSE: does it track quality / detect a broken codebook? (Spearman, not Pearson — recon MSE is heavy-tailed, spanning 1e-7 to 0.9.)
- η² = one-way ANOVA SS_between / SS_total: variance explained by model class. This is the right tool for "does it separate substrates," since class is nominal and a scalar correlation cannot capture it.
4. The zoo
41 frozen batteries from AbstractPhil/geolip-SVAE, auto-discovered as every folder with a checkpoints/best.pt, spanning:
| class | n | examples | latent |
|---|---|---|---|
| byte_trigram | 12 | byte_trigram_proto, radmag sweep, wikitext | D=4, V=32 |
| a-class | 6 | v14_noise, v15_omega_noise, johanna, alexandria | D=16, large |
| image | 8 | h2_linear_tiny_imagenet 64/128/256, v11_prod, imagenet | mixed D |
| s-class | 4 | freckles, fresnel | D=4 |
| h2-class | 2 | h2_64_1channel, h2_64_5channel | D=4 |
| sentencepiece | 1 | sentencepiece_proto_v1 | D=4 |
| unknown | 8 | SVD-mode t1/t2, noise-image variants | D=4 |
Two checkpoints had to be recovered by stripping a torch.compile _orig_mod. key prefix; three model_type='v2' checkpoints are unrecoverable (the architecture was removed) and are excluded.
5. Results
5.1 Validation
On synthetic clouds the signals respond exactly as the geometry demands: a ring in a 2-plane yields betti1 = 1, h1_max_persistence = 0.625, h1_persistence_entropy = 0 — a single dominant loop, no spurious spread; a tight cluster collapses to near-zero persistence with betti2 = 0; a uniform cloud shows many diffuse high-entropy features (the random-cloud noise floor). The deviation envelope reads −0.27 for uniform packing and −26.5 for the degenerate cluster. The machinery measures what it claims to.
5.2 Across dimension, geometry dominates
Ranked by class-separation η² over all 41 batteries:
| signal | η² | abs ρ (recon) |
|---|---|---|
| local_dim_ratio | 0.672 | 0.105 |
| giant_frac_at_uniform | 0.617 | 0.224 |
| proj_deviation | 0.595 | 0.148 |
| angular_iqr | 0.574 | 0.385 |
| deviation_envelopes | 0.548 | 0.104 |
| h1_max_persistence | 0.424 | 0.006 |
| betti2 | 0.337 | 0.471 |
| h2_max_persistence | 0.240 | 0.269 |
| h2_persistence_entropy | 0.187 | 0.267 |
The geometric signals lead decisively. But the per-class means expose what they actually separate:
| class | local_dim_ratio | giant_frac | angular_iqr |
|---|---|---|---|
| a-class | 0.104 | 0.424 | 0.077 |
| image | 0.322 | 0.974 | 0.330 |
| byte_trigram | 0.614 | 1.000 | 0.529 |
| s-class | 0.455 | 1.000 | 0.664 |
| h2-class | 0.577 | 1.000 | 0.561 |
The separation is almost entirely a-class versus everything else — the large, D=16, frequently omega-noise-trained models against the small D=4 substrates, which bunch together. This is a dimension × training-health axis, not a content axis. giant_frac in particular is isolating the omega-noise subset (those four clouds fail to percolate at the uniform baseline; everyone else's percolates). The D=4 substrates that share a dimension — byte_trigram, s-class, h2-class, sentencepiece — sit nearly on top of each other on every geometric signal.
And the high recon-|ρ| of betti2 (0.47) is the same story: the most void-saturated clouds (v11_prod at 10.2 voids/axis, v19_fresnel_tiny at 10.4, v14_noise at 8.25) are also the higher-MSE ones. Void density is a real noise-like-codebook detector, but it is class-confounded with recon, not a clean quality law.
5.3 The control: hold dimension fixed
Recon MSE is the wrong target for the structural question — a well-trained a-class model and a well-trained byte_trigram model both reconstruct fine but are topologically unrelated. So we re-ran the η² ablation within the D=4 cohort only (n_axes < 100), which removes the dimension axis. The ranking inverts:
| signal | η² cross-D | η² within-D=4 |
|---|---|---|
| local_dim_ratio | 0.671 | 0.268 ↓ |
| giant_frac_at_uniform | 0.615 | 0.309 ↓ |
| angular_iqr | 0.574 | 0.243 ↓ |
| h2_persistence_entropy | 0.182 | 0.452 ↑ |
| h1_total_persistence | 0.083 | 0.447 ↑ |
| h2_max_persistence | 0.191 | 0.418 ↑ |
| betti2 (void density) | 0.335 | 0.395 ↑ |
| proj_deviation / dev_env | 0.60 | 0.62 → |
Every geometric signal collapses — confirming they were dimension proxies. The void and loop signals rise to the top. The only geometric survivor is deviation, and it survives for a concrete reason: s-class codebooks are distinctively under-dispersed (negative envelopes) while image codebooks are over-dispersed.
5.4 The finding: voids are a substrate fingerprint
Within fixed dimension, the per-class void means say it plainly:
| class | h2_persistence_entropy | h2_max_persistence | betti2 / axis |
|---|---|---|---|
| image | 0.313 | 0.020 | 0.069 |
| s-class | 0.649 | 0.046 | 0.222 |
| h2-class | 0.815 | 0.080 | 0.261 |
| byte_trigram | 0.905 | 0.080 | 0.402 |
| sentencepiece | 0.916 | 0.065 | 0.346 |
At the same latent dimension and the same "space-filling" regime, the image codebooks are void-sparse and the symbolic-vocabulary codebooks are void-rich. A symbolic/combinatorial vocabulary builds an axis cloud that encircles topological cavities — there are configurations of representation the vocabulary structures around but never lands on. A continuous-image vocabulary builds a smoother, void-poor cloud. This distinction is invisible to every geometric signal (they are flat across the D=4 substrates) and only the persistent homology recovers it. That is the answer to what the voids are for.
5.5 Refinement: readout or content?
Two recovered SVD-mode batteries (t1_ps4_d4_v32_h128_svd and its sm32 sibling) sharpen the claim. They are the most void-structured small-D models in the entire zoo — h2_max_persistence ≈ 0.141, the highest H2 persistence anywhere, above every symbolic substrate — and omega_phase_v2 places them in void_structured alongside byte_trigram, despite their not being symbolic-text models. The clean split is therefore one diagonal of a 2×2:
| continuous / image | symbolic | |
|---|---|---|
| linear readout | void-sparse (h2_linear_tiny, h2_max 0.02–0.045) |
void-structured (byte_trigram, sentencepiece) |
| SVD readout | sparse-ish (h2_h64…svd, h2_max 0.034) |
— |
| SVD, unknown content | void-structured, strongest (t1, h2_max 0.141) |
— |
Void-sparsity is specifically the linear-readout-on-continuous corner; void-structure arises from symbolic content and/or an SVD readout. Whether the t1 voids are a readout signature or a content signature hinges on one fact — t1's training dataset — and is left open here. The honest current claim is the conditional one: at fixed dimension, the H2 voids distinguish void-sparse continuous-linear codebooks from the void-structured rest.
6. omega_phase_v2: a two-axis taxonomy
The ablation says the signal is not one taxonomy but two orthogonal axes plus a dispersion qualifier, so the classifier is built that way rather than as a flat label:
- regime (cross-dimension geometry:
local_dim_ratio,giant_frac,angular_iqr) →collapsed_fragmented·concentrated·space_filling·transitional. The dominant axis across dimensions; collapses within a fixed one. - void_character (within-dimension substrate signal:
betti2-density +h2_persistence_entropy) →void_sparse(continuous/image) ·void_structured(symbolic and/or SVD) ·void_saturated(noise-like) ·void_mixed. The signal the geometry cannot see. - dispersion (the
dev_criticalenvelope) →under_dispersed(s-class) ·in_envelope·over_dispersed.
Validated against the real battery values, the labels land where the structure dictates:
| battery | regime | dispersion | void_character |
|---|---|---|---|
| byte_trigram_proto_128 | space_filling | in_envelope | void_structured |
| sentencepiece_proto | space_filling | in_envelope | void_structured |
| h2_linear_tiny_imagenet_64/128/256 | space_filling | in/over | void_sparse |
| v40_freckles (s-class) | space_filling | under_dispersed | void_sparse |
| v15_omega_noise | collapsed_fragmented | over_dispersed | void_structured |
| v11_prod / v14_noise | concentrated | in_envelope | void_saturated |
| t1_…_svd | space_filling | in_envelope | void_structured |
Thresholds are empirical from this zoo and tunable; omega_phase_v2 is descriptive telemetry, deliberately not a loss and not a proof.
7. Limitations
This is an honest map of a small, heterogeneous zoo, not a controlled study.
- Small per-class n. η² is biased upward for tiny groups;
sentencepiece(n=1) andh2-class(n=2) are unreliable. The robust comparisons are byte_trigram (12), image (8), a-class (6), s-class (4). The image void-sparsity holds because it is consistent across all three D=4 image batteries, not because n is large. - Class confounds dimension and training health. The cross-dimension separation is mostly a dimension/training axis; the substrate claim is made only within fixed dimension.
- Readout vs content is unresolved. The t1 SVD result shows void-structure is not exclusively symbolic; the disambiguating experiment (cross the full 2×2) is future work.
- Descriptive, not generative. None of this is yet a training signal. The codebook extraction (antipodal collapse, ripser) is non-differentiable; these are measurement and classification signals.
8. Conclusion
The voids in a frozen codebook were being computed and discarded. Measured in the correct projective metric and normalized to be scale-free, they turn out to carry a substrate fingerprint that no geometric measurement of the same cloud can recover: at fixed latent dimension, symbolic vocabularies build void-rich codebooks and continuous-image vocabularies build void-sparse ones. The result is robust to a dimension control that flattens every competing signal, and it is operationalized in a two-axis phase taxonomy that cleanly separates codebook regime (a geometric, cross-dimension property) from void character (a topological, within-dimension substrate property).
The broader point: a frozen ~57k-parameter codec is not just a faithful reconstructor. Its codebook has a topology, and that topology remembers what kind of thing it learned to encode. We were already paying to measure it. Now we read it.
Appendix: signal definitions
All distances projective: d(a,b) = arccos|⟨a,b⟩| ∈ [0, π/2]; axes sign-canonicalized; persistences normalized by π/2; counts and sums normalized per-axis.
| signal | formula | rule preserved |
|---|---|---|
proj_deviation |
mean d − uniform_projective_angle(D) |
projective baseline |
deviation_envelopes |
proj_deviation / (0.02·√D) |
dev_critical / rigidity barrier |
angular_iqr |
p75 − p25 of pairwise d |
projective spread |
percolation_ratio |
θ_giant(proj) / uniform_projective_angle(D) |
baseline-relative connection |
giant_frac_at_uniform |
largest component / n at θ = uniform | baseline-pinned threshold |
local_dim_ratio |
median k-NN participation-ratio dim / D | tangent occupancy |
betti1 |
(finite H1 features) / n_axes | intensive loop count |
h1_{total,max}_persistence |
Σ or max (death−birth) over H1 / (π/2) [/n for total] | projective persistence |
h1_persistence_entropy |
normalized Shannon entropy of H1 persistences | spectrum regularity |
betti2 |
(finite H2 features) / n_axes | intensive void density |
h2_{total,max}_persistence |
Σ or max (death−birth) over H2 / (π/2) [/n for total] | projective void strength |
h2_persistence_entropy |
normalized Shannon entropy of H2 persistences | void-spectrum regularity |
pairing_fraction |
n_pairs / (n_pairs + n_unpaired) | aleph ± bit (cos < −0.9 collapse) |
Scripts
Install
github https://github.com/AbstractEyes/geolip-svae
pypackage ripser
Tooling: codebook_contributions.py (signals, omega_phase_v2, ablation harness) and battery_ablation.py (cross-battery driver). numpy + scipy + ripser; torch-free analysis layer over the frozen geolip-svae batteries.

