Reading the Voids: Topological Contribution Signals in Frozen Geometric Codebooks

Community Article Published May 31, 2026

How the H2 cavities of a frozen autoencoder's codebook encode a substrate fingerprint that no geometric measurement can see — and a two-axis phase taxonomy built on it.

AbstractPhil · May 2026 · geolip-svae / geolip-core Methodology and ablation tooling developed with Mirel.


TL;DR

A frozen PatchSVAE battery induces a codebook — a small cloud of sign-canonicalized axes on projective space ℝPᴰ⁻¹. The shipped classifier that labels a codebook's "omega phase" reads only its H0 connectivity and discards everything else the topology probe computes, including the H1 loops and H2 voids of the axis cloud.

We turned every discarded quantity into an independently-toggleable contribution signal, fixed a metric-alignment bug (the probe measured persistence on the raw sphere, not the projective space the axes actually live on), and ablated all 15 signals across 41 frozen batteries spanning six model classes and three substrate types.

The headline result survives a dimension control that kills every geometric signal:

At fixed latent dimension, the H2 voids separate symbolic substrates from continuous ones — a distinction the geometry cannot make. Image codebooks are void-sparse; symbolic-vocabulary codebooks are void-rich. Voids are a substrate fingerprint.

We package the finding as omega_phase_v2, a two-axis taxonomy: a cross-dimension regime axis (the geometry) and a within-dimension void-character axis (the topology).


1. Background: what a codebook is

In the geolip-svae framework, a PatchSVAE "battery" is a tiny, frozen per-patch reconstruction codec. On a forward pass it emits a structured latent M of shape (B, N, V, D) — for each of N patches, a V × D frame whose rows live in a D-dimensional singular space.

The codebook is what you get by aggregating M across a calibration set, then collapsing antipodal row pairs (cosine < −0.9) into single representatives. The survivors are unit vectors, sign-canonicalized so that the first non-zero coordinate is positive — i.e. points on the real projective space ℝPᴰ⁻¹, where a vector and its antipode are the same point. This is the aleph convention: a codebook entry is an address (which axis) plus a sign bit (which antipode), and the system's discriminative readout, the AlephAssessor, reads M by that signed-projective address rather than by raw cosine.

A codebook is therefore a small point cloud — typically 20–250 axes — on a low-dimensional projective sphere. Its shape is a fingerprint of what the frozen battery learned to represent.

2. The gap: voids computed, then thrown away

The framework already analyzes that shape. run_topology_analysis runs three probes over the axis cloud: a kNN connectivity sweep, a local-intrinsic-dimension PCA, and — via ripserpersistent homology to dimension two, producing H0 (connected components), H1 (loops), and H2 (voids/cavities).

But the classifier that consumes this, _classify_omega_phase, looks at H0 only: how many components persist, whether they form pairs or clusters, where the giant component appears. The H1 loops and H2 voids are computed at full cost, written into the report JSON — and then read by nothing. No phase label, no telemetry, no decision depends on them.

That is the waste this work addresses. Reconstruction MSE tells you a codec is faithful. The codebook topology tells you how the latent is organized — and the voids are the part of that organization we were paying to compute and choosing to ignore. A void in the axis cloud is a region of representational space the vocabulary encircles but never occupies: a hole in what the battery can place there. The question is whether that hole means anything.

3. Method

3.1 A metric-alignment fix

Before adding any signal, one correction. The codebook axes are projective — compared everywhere else in the system by the folded angle arccos|cos θ| ∈ [0, π/2] on ℝPᴰ⁻¹, against the uniform-packing baseline uniform_projective_angle(D), within the envelope dev_critical(D) = 0.02·√D. But the stock topology probe computes persistence on the raw sphere angle arccos(cos θ) ∈ [0, π]. For sign-canonicalized representatives those are not the same metric, and loops/voids measured on the raw sphere double-count antipodal geometry the system treats as identified.

Every signal below is recomputed on the projective distance d(a,b) = arccos|⟨a,b⟩|, and deviation-type signals are normalized by dev_critical so that |x| > 1 means exactly what the architectural rigidity barrier means: out of envelope. The topology now lives in the same space as the aleph address.

3.2 Fifteen contribution signals

Each discarded quantity becomes a named signal with an explicit formula, the omega/aleph rule it preserves, and its intended use. Grouped:

Geometry / dispersionproj_deviation (mean projective angle minus the uniform baseline), deviation_envelopes (the same in dev_critical units), angular_iqr (inter-quartile spread of pairwise projective angles).

Connectivitypercolation_ratio (the angle at which a giant component forms, relative to the uniform baseline), giant_frac_at_uniform (largest component fraction with the graph thresholded at the uniform baseline rather than an arbitrary degree).

Local geometrylocal_dim_ratio (median participation-ratio dimension of each axis's k-NN offset cloud, divided by D: → 0 if axes lie on a curve, → 1 if they fill the tangent space).

Loops (H1)betti1 (loop count per axis), h1_total_persistence, h1_max_persistence, h1_persistence_entropy.

Voids (H2)betti2 (void count per axis), h2_total_persistence, h2_max_persistence, h2_persistence_entropy. The headline.

Aleph structurepairing_fraction (fraction of rows that collapsed into antipodal pairs — the realization of the sign bit).

A subtle but decisive design point emerged during ablation (§5.1): count- and sum-type signals must be intensive. Raw Betti numbers and total persistence scale with the number of axes, so on a zoo spanning 16 to 256 axes their variance is dominated by codebook size, not structure. All four are normalized per-axis (betti2 / n_axes, total persistence / n_axes); max and entropy are already scale-free.

3.3 Ablation metrics

For each signal across the battery zoo we report three things, because they answer different questions:

  • cv = std / |mean|: scale-free spread. Does the signal vary at all?
  • |ρ| = |Spearman| with reconstruction MSE: does it track quality / detect a broken codebook? (Spearman, not Pearson — recon MSE is heavy-tailed, spanning 1e-7 to 0.9.)
  • η² = one-way ANOVA SS_between / SS_total: variance explained by model class. This is the right tool for "does it separate substrates," since class is nominal and a scalar correlation cannot capture it.

4. The zoo

41 frozen batteries from AbstractPhil/geolip-SVAE, auto-discovered as every folder with a checkpoints/best.pt, spanning:

class n examples latent
byte_trigram 12 byte_trigram_proto, radmag sweep, wikitext D=4, V=32
a-class 6 v14_noise, v15_omega_noise, johanna, alexandria D=16, large
image 8 h2_linear_tiny_imagenet 64/128/256, v11_prod, imagenet mixed D
s-class 4 freckles, fresnel D=4
h2-class 2 h2_64_1channel, h2_64_5channel D=4
sentencepiece 1 sentencepiece_proto_v1 D=4
unknown 8 SVD-mode t1/t2, noise-image variants D=4

Two checkpoints had to be recovered by stripping a torch.compile _orig_mod. key prefix; three model_type='v2' checkpoints are unrecoverable (the architecture was removed) and are excluded.

5. Results

5.1 Validation

On synthetic clouds the signals respond exactly as the geometry demands: a ring in a 2-plane yields betti1 = 1, h1_max_persistence = 0.625, h1_persistence_entropy = 0 — a single dominant loop, no spurious spread; a tight cluster collapses to near-zero persistence with betti2 = 0; a uniform cloud shows many diffuse high-entropy features (the random-cloud noise floor). The deviation envelope reads −0.27 for uniform packing and −26.5 for the degenerate cluster. The machinery measures what it claims to.

5.2 Across dimension, geometry dominates

Ranked by class-separation η² over all 41 batteries:

signal η² abs ρ (recon)
local_dim_ratio 0.672 0.105
giant_frac_at_uniform 0.617 0.224
proj_deviation 0.595 0.148
angular_iqr 0.574 0.385
deviation_envelopes 0.548 0.104
h1_max_persistence 0.424 0.006
betti2 0.337 0.471
h2_max_persistence 0.240 0.269
h2_persistence_entropy 0.187 0.267

The geometric signals lead decisively. But the per-class means expose what they actually separate:

class local_dim_ratio giant_frac angular_iqr
a-class 0.104 0.424 0.077
image 0.322 0.974 0.330
byte_trigram 0.614 1.000 0.529
s-class 0.455 1.000 0.664
h2-class 0.577 1.000 0.561

The separation is almost entirely a-class versus everything else — the large, D=16, frequently omega-noise-trained models against the small D=4 substrates, which bunch together. This is a dimension × training-health axis, not a content axis. giant_frac in particular is isolating the omega-noise subset (those four clouds fail to percolate at the uniform baseline; everyone else's percolates). The D=4 substrates that share a dimension — byte_trigram, s-class, h2-class, sentencepiece — sit nearly on top of each other on every geometric signal.

And the high recon-|ρ| of betti2 (0.47) is the same story: the most void-saturated clouds (v11_prod at 10.2 voids/axis, v19_fresnel_tiny at 10.4, v14_noise at 8.25) are also the higher-MSE ones. Void density is a real noise-like-codebook detector, but it is class-confounded with recon, not a clean quality law.

5.3 The control: hold dimension fixed

Recon MSE is the wrong target for the structural question — a well-trained a-class model and a well-trained byte_trigram model both reconstruct fine but are topologically unrelated. So we re-ran the η² ablation within the D=4 cohort only (n_axes < 100), which removes the dimension axis. The ranking inverts:

signal η² cross-D η² within-D=4
local_dim_ratio 0.671 0.268
giant_frac_at_uniform 0.615 0.309
angular_iqr 0.574 0.243
h2_persistence_entropy 0.182 0.452
h1_total_persistence 0.083 0.447
h2_max_persistence 0.191 0.418
betti2 (void density) 0.335 0.395
proj_deviation / dev_env 0.60 0.62

Every geometric signal collapses — confirming they were dimension proxies. The void and loop signals rise to the top. The only geometric survivor is deviation, and it survives for a concrete reason: s-class codebooks are distinctively under-dispersed (negative envelopes) while image codebooks are over-dispersed.

5.4 The finding: voids are a substrate fingerprint

Within fixed dimension, the per-class void means say it plainly:

class h2_persistence_entropy h2_max_persistence betti2 / axis
image 0.313 0.020 0.069
s-class 0.649 0.046 0.222
h2-class 0.815 0.080 0.261
byte_trigram 0.905 0.080 0.402
sentencepiece 0.916 0.065 0.346

At the same latent dimension and the same "space-filling" regime, the image codebooks are void-sparse and the symbolic-vocabulary codebooks are void-rich. A symbolic/combinatorial vocabulary builds an axis cloud that encircles topological cavities — there are configurations of representation the vocabulary structures around but never lands on. A continuous-image vocabulary builds a smoother, void-poor cloud. This distinction is invisible to every geometric signal (they are flat across the D=4 substrates) and only the persistent homology recovers it. That is the answer to what the voids are for.

5.5 Refinement: readout or content?

Two recovered SVD-mode batteries (t1_ps4_d4_v32_h128_svd and its sm32 sibling) sharpen the claim. They are the most void-structured small-D models in the entire zooh2_max_persistence ≈ 0.141, the highest H2 persistence anywhere, above every symbolic substrate — and omega_phase_v2 places them in void_structured alongside byte_trigram, despite their not being symbolic-text models. The clean split is therefore one diagonal of a 2×2:

continuous / image symbolic
linear readout void-sparse (h2_linear_tiny, h2_max 0.02–0.045) void-structured (byte_trigram, sentencepiece)
SVD readout sparse-ish (h2_h64…svd, h2_max 0.034)
SVD, unknown content void-structured, strongest (t1, h2_max 0.141)

Void-sparsity is specifically the linear-readout-on-continuous corner; void-structure arises from symbolic content and/or an SVD readout. Whether the t1 voids are a readout signature or a content signature hinges on one fact — t1's training dataset — and is left open here. The honest current claim is the conditional one: at fixed dimension, the H2 voids distinguish void-sparse continuous-linear codebooks from the void-structured rest.

6. omega_phase_v2: a two-axis taxonomy

The ablation says the signal is not one taxonomy but two orthogonal axes plus a dispersion qualifier, so the classifier is built that way rather than as a flat label:

  • regime (cross-dimension geometry: local_dim_ratio, giant_frac, angular_iqr) → collapsed_fragmented · concentrated · space_filling · transitional. The dominant axis across dimensions; collapses within a fixed one.
  • void_character (within-dimension substrate signal: betti2-density + h2_persistence_entropy) → void_sparse (continuous/image) · void_structured (symbolic and/or SVD) · void_saturated (noise-like) · void_mixed. The signal the geometry cannot see.
  • dispersion (the dev_critical envelope) → under_dispersed (s-class) · in_envelope · over_dispersed.

Validated against the real battery values, the labels land where the structure dictates:

battery regime dispersion void_character
byte_trigram_proto_128 space_filling in_envelope void_structured
sentencepiece_proto space_filling in_envelope void_structured
h2_linear_tiny_imagenet_64/128/256 space_filling in/over void_sparse
v40_freckles (s-class) space_filling under_dispersed void_sparse
v15_omega_noise collapsed_fragmented over_dispersed void_structured
v11_prod / v14_noise concentrated in_envelope void_saturated
t1_…_svd space_filling in_envelope void_structured

Thresholds are empirical from this zoo and tunable; omega_phase_v2 is descriptive telemetry, deliberately not a loss and not a proof.

7. Limitations

This is an honest map of a small, heterogeneous zoo, not a controlled study.

  • Small per-class n. η² is biased upward for tiny groups; sentencepiece (n=1) and h2-class (n=2) are unreliable. The robust comparisons are byte_trigram (12), image (8), a-class (6), s-class (4). The image void-sparsity holds because it is consistent across all three D=4 image batteries, not because n is large.
  • Class confounds dimension and training health. The cross-dimension separation is mostly a dimension/training axis; the substrate claim is made only within fixed dimension.
  • Readout vs content is unresolved. The t1 SVD result shows void-structure is not exclusively symbolic; the disambiguating experiment (cross the full 2×2) is future work.
  • Descriptive, not generative. None of this is yet a training signal. The codebook extraction (antipodal collapse, ripser) is non-differentiable; these are measurement and classification signals.

8. Conclusion

The voids in a frozen codebook were being computed and discarded. Measured in the correct projective metric and normalized to be scale-free, they turn out to carry a substrate fingerprint that no geometric measurement of the same cloud can recover: at fixed latent dimension, symbolic vocabularies build void-rich codebooks and continuous-image vocabularies build void-sparse ones. The result is robust to a dimension control that flattens every competing signal, and it is operationalized in a two-axis phase taxonomy that cleanly separates codebook regime (a geometric, cross-dimension property) from void character (a topological, within-dimension substrate property).

The broader point: a frozen ~57k-parameter codec is not just a faithful reconstructor. Its codebook has a topology, and that topology remembers what kind of thing it learned to encode. We were already paying to measure it. Now we read it.

Appendix: signal definitions

All distances projective: d(a,b) = arccos|⟨a,b⟩| ∈ [0, π/2]; axes sign-canonicalized; persistences normalized by π/2; counts and sums normalized per-axis.

signal formula rule preserved
proj_deviation mean duniform_projective_angle(D) projective baseline
deviation_envelopes proj_deviation / (0.02·√D) dev_critical / rigidity barrier
angular_iqr p75 − p25 of pairwise d projective spread
percolation_ratio θ_giant(proj) / uniform_projective_angle(D) baseline-relative connection
giant_frac_at_uniform largest component / n at θ = uniform baseline-pinned threshold
local_dim_ratio median k-NN participation-ratio dim / D tangent occupancy
betti1 (finite H1 features) / n_axes intensive loop count
h1_{total,max}_persistence Σ or max (death−birth) over H1 / (π/2) [/n for total] projective persistence
h1_persistence_entropy normalized Shannon entropy of H1 persistences spectrum regularity
betti2 (finite H2 features) / n_axes intensive void density
h2_{total,max}_persistence Σ or max (death−birth) over H2 / (π/2) [/n for total] projective void strength
h2_persistence_entropy normalized Shannon entropy of H2 persistences void-spectrum regularity
pairing_fraction n_pairs / (n_pairs + n_unpaired) aleph ± bit (cos < −0.9 collapse)

Scripts

Colab Notebook Here

Scripts

Install

github https://github.com/AbstractEyes/geolip-svae
pypackage ripser

Tooling: codebook_contributions.py (signals, omega_phase_v2, ablation harness) and battery_ablation.py (cross-battery driver). numpy + scipy + ripser; torch-free analysis layer over the frozen geolip-svae batteries.

Community

Sign up or log in to comment