geolip-svae-batteries

The Johanna Formula Subset Is Here https://huggingface.co/AbstractPhil/geolip-svae-batteries/blob/main/JOHANNA_FORMULA_CATALOG.md

F-class SVAE batteries β€” an experimental nursery for miniature noise-reconstruction autoencoders.

This repository hosts a sweep of deliberately-small SVAE models. Most of them will collapse. That's the point.


Context: what is an SVAE battery?

An SVAE (Spectral Variational AutoEncoder) is an image-reconstruction model whose internal representation is an SVD-decomposed matrix per patch. Each patch gets encoded into a matrix M ∈ ℝ^(VΓ—D) with rows on the unit hypersphere, then SVD-decomposed into (U, S, Vα΅€). The model reconstructs by MΜ‚ = UΒ·diag(S)Β·Vα΅€ back through a decoder.

The singular values S are called omega tokens β€” a coordinate-system-free representation that, at the right scale, exhibits a universal attractor: Sβ‚€ β‰ˆ 5.1, effective rank β‰ˆ 15.88, Cayley-Menger pentachoron CV in the 0.13–0.30 band. These constants hold across 48+ measurements on different image distributions.

See the paper on HuggingFace for full theory: Omega Tokens paper.

A battery is an SVAE that has achieved the attractor β€” its omega tokens are consistently structured, reconstruction is reversible, the internal geometry is legible to downstream wrappers. Batteries can be chained, stacked, and channeled.


Class taxonomy

The SVAE lineage breaks into three classes, defined by behavior rather than architecture:

A-class (Johanna, Fresnel) β€” the workhorses

  • 17M parameters, V=256, D=16, patch=16, hidden=768, depth=4
  • Teachable downstream. Omega attractor emerges cleanly. Can be used as a frozen encoder for classification, retrieval, diffusion conditioning.
  • Cost: ~30 GB VRAM, ~4 hr/epoch on H100. Physically impossible to stack multiple instances on consumer hardware.
  • Johanna trains at S=128, 1.28M noise samples Γ— 16 noise types per epoch. Reaches MSE 0.029 on pure Gaussian noise reconstruction.

S-class (Freckles) β€” the crown that can't be worn alone

  • 2.5M parameters, per-patch internals V=48, D=4, patch_size=4, hidden=384, depth=4. These stay fixed regardless of resolution.
  • The distinctive property is patch count. At 64Γ—64 that's 256 patches; at 256Γ—256 it's 4,096 patches ("Freckles-4096"); at 512Γ—512 it would be 16,384. All per-patch weights (encoder, decoder, cross-attention QKV, alpha) are dimensioned by D=4, not by N, so adding patches is essentially free. The architecture is resolution-complete by construction β€” v40 (64Γ—64) transferred to 256Γ—256 in a single fine-tuning epoch with per-patch spectrum nearly unchanged (Sβ‚€ within 0.4%, erank identical).
  • Superior reconstruction fidelity at every resolution tested. Beats A-class on pure recon metrics.
  • But: "too disorderly stored." A single Freckles instance cannot teach downstream β€” attempting to use it as a frozen encoder fails to converge. The full Freckles array is required for anything beyond self-reconstruction.
  • S-class carries omega tokens internally but they're not legible to a wrapper without the full infrastructure.

F-class (this repo) β€” the experimental nursery

  • 2K–645K parameters per model. 30Γ— to 8000Γ— smaller than A-class.
  • Designed to fail. Most F-class configurations will collapse, oscillate, or produce illegible geometry. This is not a bug in the sweep β€” it's the sweep's purpose.
  • Research question: what is the minimum viable architecture at which any legible battery behavior emerges? Can small batteries be stacked where large batteries cannot?

F-class is an experimental nursery. We cast a wide net, let most configurations collapse, and examine the survivors (if any) to understand the boundary conditions.


Methodology

This project uses an unusual training methodology that's worth explaining before reading the code or results.

There are no boundaries β€” only physics

The model isn't clipped or forced anywhere. The training setup provides a playground with physics:

  • MSE loss = gravity. Forces reconstruction to be reversible. Whatever arrangement the model finds must map back to the input.
  • Soft-hand = friction. Gentle substrate conditioning that rewards geometric coherence without targeting any specific state.
  • Sphere-norm + fp64 SVD = coordinate discipline. Signposts that make movement lossless and measurable. Not walls.
  • Cross-attention with bounded Ξ± = spectral coordination. Lets omega tokens communicate across patches without dominating.

There are no walls. No clipping-as-control. No forced convergence targets. The model finds its own home in this playground; the training job is to provide the physics.

Soft-hand guides against CV-EMA, not against CV

The central innovation of this sweep is how the soft-hand operates.

Traditional regularization targets a specific value: "drive CV toward 0.25." This forces the model toward an arrangement that has no structural reason to exist at F-class scale β€” CV's meaning at small D differs from its meaning in the A-class omega-attractor regime.

We do not target CV. We track CV's own exponential moving average (cv_ema) and reward the model for being geometrically coherent relative to its own trajectory:

cv_ema = (1 - Ξ±) Β· cv_ema + Ξ± Β· current_cv    # Ξ± = 0.01, slow
Οƒ = cv_sigma_scale Β· cv_ema                   # scale-adaptive width
prox = exp(-(current_cv - cv_ema)Β² / (2·σ²))
loss = (1 + boost Β· prox) Β· recon_loss        # NO penalty term

When the model's CV tracks its own trend smoothly, proximity β†’ 1 and the recon gradient is boosted (the soft-hand "rewards" the arrangement). When CV deviates chaotically from its trend, proximity β†’ 0 and the recon gradient is unboosted (no punishment, just no reward).

There is no penalty term. The model is never pushed toward any specific state. We only reward geometric legibility.

Discharges are orbital perturbations, not failures

Gradient spikes and sudden loss excursions in SVD-learning architectures (Phil calls them "discharges") are a known phenomenon. Previous architectures treated them as failure modes to be eliminated via clipping, damping, or early stopping.

Here, we treat them as orbital perturbations. The model orbits around some arrangement it has found; occasional perturbations are the natural dynamics of that orbit. A battery at its edge exhibits perturbations on the way to stability. Clipping them out prevents the model from settling.

We log discharges (max_grad, prox, recon_w excursions) but don't prevent them. Runs complete regardless of how chaotic their trajectory looks. Collapse isn't death β€” it's a data point.

Collapse is a data point

F-class collapse manifests as:

  • test_mse stuck near or above the noise baseline (β‰ˆ1.0 for unit-variance inputs)
  • erank collapsing to 1 or pegging at D (no differentiation of singular values)
  • cv_ema jittering without a stable trajectory (prox low throughout)
  • alpha_mean collapsing to 0 or saturating at max_alpha

Collapsed runs are not killed early. They complete their full training budget, their full history is preserved, and they are uploaded alongside surviving runs for post-hoc analysis. Understanding how a configuration fails is as useful as understanding which ones succeed.


Sweep design

The sweep contains 19 configurations across 7 tiers. All configurations use:

  • Pure Adam optimizer (no weight decay, per lineage convention)
  • fp64 SVD throughout (Gram + eigh + fp64 floors, never fp32 in the decomposition pipeline)
  • Cosine LR schedule
  • 1.28M noise samples Γ— 16 noise types per epoch, 30 epochs
  • 2 alignment epochs (pure MSE) before soft-hand activates

Tier 1 β€” D=16 spine at small scale

Does the universal attractor (validated at D=16) survive substrate shrinkage?

Tier 2 β€” D-sweep at matched substrate

D ∈ {8, 4, 2} at identical (V, hidden, depth, patch). Does battery behavior survive D < 16? D=2 is below the CM dimension floor and likely collapses; whether D=4 and D=8 survive is an open question that A-class never tested.

Tier 3 β€” Substrate axis test

At (V=64, D=8), we compare wider substrate (hidden=128), deeper substrate (depth=2), and starved substrate (hidden=32). Which axis carries the self-assembly work?

Tier 4 β€” Big patchworks, tiny internals

256 patches per image, small per-cell substrate. Tests "many weak cells" as an alternative to "few strong cells."

Tier 5 β€” Small patchworks, stronger cells

4 patches per image, larger per-cell substrate. Inverse question.

Tier 6 β€” Unusual shapes

V=256, D=2 (extreme V/D ratio), V=D=16 (square), V=8, D=16 (wide). These are chaos measurements; most are expected to produce illegible geometry.

Tier 7 β€” Extreme smallness

10K and 2K parameter models. Exists to establish what total collapse looks like as a reference point.


Repository structure

geolip-svae-batteries/
β”œβ”€β”€ README.md                              # this file
β”œβ”€β”€ sweep_summary.json                     # combined results from all completed runs
└── johanna-F-S{S}-V{V}-D{D}-h{H}-d{d}-p{P}/
    β”œβ”€β”€ config.json                        # full RunConfig snapshot
    β”œβ”€β”€ final_report.json                  # metrics history, final geometric state
    β”œβ”€β”€ checkpoints/
    β”‚   β”œβ”€β”€ best.pt                        # lowest test MSE checkpoint
    β”‚   └── epoch_NNNN.pt                  # periodic saves
    └── tensorboard/
        └── events.out.tfevents.*          # full TB trace

Run naming convention

johanna-F-S{img_size}-V{matrix_v}-D{D}-h{hidden}-d{depth}-p{patch_size}

Example: johanna-F-S64-V64-D8-h64-d1-p16 = image size 64, matrix V=64, spectral dim D=8, hidden 64, depth 1, patch size 16.

Metrics captured in final_report.json

Reconstruction:

  • best_test_mse β€” minimum test MSE achieved
  • final_epoch_mse β€” test MSE at final epoch
  • final_recon_ema_obs β€” observable EMA of training recon loss

Geometry:

  • final_S0, final_SD, final_ratio β€” first/last singular value, ratio
  • final_erank β€” effective rank of singular values
  • final_row_cv, final_cv_ema β€” CV and its guidance EMA
  • final_S_delta β€” mean absolute change in S after cross-attention
  • final_alpha_mean, final_alpha_std β€” cross-attention spectral coordination strength

Full history (history): every report_every batches, all of the above plus phase tag (ALGN during alignment, HAND after) and per-report max gradient.


Reading the results

We avoid posting per-run numerical tables in text form because most runs will look similar-but-degenerate, and the signal is in the shape of the trajectories, not individual numbers.

Load the provided readout script to pool results locally:

# in Colab
exec(open('johanna_F_readout.py').read())
report = readout("AbstractPhil/geolip-svae-batteries")

# Compact survival table β€” one line per run
survival_table(report)

# Full geometric metrics per run
detailed_table(report)

# Trajectory plot (top 12 runs by best_test_mse)
plot_trajectories(report, metric='test_mse')
plot_trajectories(report, metric='cv_ema')

What "success" means here

A surviving F-class run is not one that beats Johanna. It is one that:

  1. Reconstructs meaningfully β€” best_test_mse drops below the noise baseline (< 0.95 for unit-variance inputs).
  2. Produces legible geometry β€” cv_ema stabilizes into a narrow trajectory (prox stays high during HAND phase); erank differentiates singular values without collapsing or saturating.
  3. Doesn't thrash β€” gradient trajectory is bounded; discharges exist but don't dominate.

A surviving F-class run is a proof-of-concept that battery behavior can occur at miniature scale β€” the kind of behavior that could eventually be stacked into a multi-cell noise solver without the A-class coal-plant cost.

The sweep is not expected to produce many survivors. It is expected to produce a map of where the boundary of viability sits.


Related work

  • Omega Tokens paper β€” theoretical foundation, A-class results, universal attractor constants
  • geolip-deep-embedding-analysis β€” CM-CV dimension sweep (65,536 configs), reference for CV band bounds
  • procrustes-analysis β€” cross-modal alignment study (17 models), source of binding constant 0.29154
  • AbstractPhil/geolip-SVAE β€” Johanna (A-class) and Freckles (S-class) training runs

Citation & contact

This is independent AI research. The methodology here β€” "no boundaries, just physics" β€” is unconventional. If a configuration appears to violate standard ML intuitions (no gradient clipping below certain thresholds, no early stopping on collapse, no CV-targeting), that is by design.

The batteries find their own homes.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support