geolip-hypersphere-experiments / OMEGA_PROGRESSION.md
AbstractPhil's picture
Create OMEGA_PROGRESSION.md
a9195c3 verified
# Potential Downstream Utilities Clause
**Status:** Forward-looking. Each utility takes the Omega substrate as a
load-bearing assumption — regime-independence of reconstruction quality
across input scale, the projective-axis codebook as a deterministic
property of trained sphere-solvers, and hardware-determined throughput
limits independent of model behavior. Utilities that would work
equivalently on any encoder are excluded; this is a list of capabilities
that are *enabled* by Omega, not capabilities incidentally compatible
with it.
**Methodology.** Per the post-000108 research stage, every utility
section ends with a falsifiable prediction — what would have to be true
for the utility to NOT work. Construction precedes proof. The first
build that fails its prediction tells us where the substrate's
boundary actually is.
---
## 1. Classification
**The utility.** A projective codebook of `n_axes` directions on
ℝP^(D-1) is a vocabulary of feature primitives. Image → patch grid → M
tensor → per-patch projection onto codebook axes → activation pattern
of shape `[B, n_patches, V, n_axes]`. A linear or shallow head over
this representation performs classification.
**Why Omega.** The codebook is model-intrinsic and regime-flat. A
classifier trained on activation patterns at 64×64 should generalize
to 512×512 inputs at inference without retraining, because the
codebook itself doesn't change with input size. Standard CLIP-style
models do not give this property — their representations drift with
input resolution; their pooling operations bake in a particular spatial
extent.
**Specific construction.** Train classifier head on per-patch axis
activations averaged across patches (or attended-over). For
fine-grained tasks, retain the spatial structure: classifier sees the
full `[n_patches, n_axes]` matrix as a 2D feature map. Per-patch
aggregation already validated in scratchpad 000104 — patch_idx=0 fails
because it discards spatial signal; patch-mean recovers most of the
gap.
**Falsifiable prediction.** A classifier trained on 64×64 activation
patterns achieves comparable accuracy on 512×512 test inputs (within
2 percentage points) without any architectural adaptation. If accuracy
drops sharply with input resolution, the codebook activations are not
in fact regime-invariant in the way reconstruction is, and Omega
covers reconstruction but not classification — a meaningful boundary.
---
## 2. Diffusion
**The utility.** Discrete diffusion in axis-index space. Each patch's
M-tensor row gets quantized to its nearest codebook axis (or top-k
mixture). The "noise" process is gradual randomization of axis
assignments; the "denoise" process is a transformer that predicts
axis indices from corrupted sequences. Sampling = run denoiser to
clean axis sequence → reconstruct image via codebook → decoder.
**Why Omega.** Three properties combine here. The codebook is a
finite, deterministic vocabulary, so discrete diffusion is well-defined
without extra quantizer training. The decoder is regime-flat, so a
diffusion model trained on 64×64 axis sequences can sample at any
resolution by predicting longer sequences and decoding at the target
size. The codebook's projective structure means antipodal axes carry
equivalent information — meaningfully reduces the effective
vocabulary size for the diffusion target.
**Specific construction.** Diffusion target: `[n_patches, top_k]`
discrete indices into codebook. Loss: cross-entropy over axis indices.
Backbone: any transformer that handles variable-length token sequences
(patch count varies with target resolution). Conditioning: optional
class label or text embedding via cross-attention.
**Falsifiable prediction.** A diffusion model trained on 64×64 axis
sequences from h2-64 produces coherent samples at 256×256 by sampling
longer sequences and decoding at the target size, without retraining.
If samples at non-native resolution show mode collapse or boundary
artifacts beyond what the encoder-decoder pair produces directly,
the codebook's discreteness is interfering with the regime-flat
reconstruction — narrower than expected.
---
## 3. Processing (image-to-image edits in axis space)
**The utility.** Operations applied to codebook activations rather
than pixels. Image → encode → edit activations → decode. Style
transfer, denoising, inpainting, semantic editing all become
manipulations of the `[n_patches, V, n_axes]` activation tensor,
followed by reconstruction.
**Why Omega.** Edits made at one resolution are coherent when decoded
at another, because the codebook is the same vocabulary at every
scale. A 64×64 inpaint mask can produce a 512×512 inpainted output by
upsampling the edited activations and decoding at the target size.
Critically, the activation edits respect the geometric constraints
that produced the codebook — operations that move activations *off*
the codebook produce reconstruction artifacts that are themselves a
useful signal.
**Specific construction.** Define edit operations as activation-tensor
transformations: zero-out (denoise), substitute axis-set (style
transfer), spatial-gather + redistribute (inpaint), interpolate
between two images' activations (semantic morph). Provide a
`process_at_scale` API mirroring `reconstruct_at_scale`.
**Falsifiable prediction.** Style transfer applied to 64×64
activations and decoded at 512×512 produces output indistinguishable
in style consistency from the same operation applied directly to a
512×512 encoding. If the upsampled-edit path produces worse style
transfer than the direct-encode path, the activation upsampling is
losing geometric structure that the encoder captures — and Omega's
regime-flatness has a stricter envelope than reconstruction MSE
alone reveals.
---
## 4. Solving
**The utility.** The most direct framing: use the trained sphere-solver
to solve geometric problems on its native manifold. Given a set of
points in ℝ^D, encode them via the model's projection path to get
their representation on RP^(D-1). Given a set of vectors, solve for
the codebook axes that span them. Given two sets of points, find the
optimal projective alignment via Procrustes on their codebooks.
**Why Omega.** This is the closest utility to the model's identity
claim. The model is named "sphere-solver" because that's what it is —
a parametric solver for "what's the best projective representation of
this data on the unit sphere?" The Omega finding is that this solver
is regime-independent: the same machinery handles 64 input points or
65,536 input points and produces structurally consistent answers.
**Specific construction.** Expose three solver primitives:
- `project(points, model) → axes`: encode arbitrary point clouds via
the model's encoder to get their codebook representation
- `align(codebook_a, codebook_b) → rotation`: Procrustes-align two
codebooks (already implemented in tests/framework.py)
- `solve_basis(target_vectors, model) → axis_indices`: given target
vectors, find the codebook axes that best span them
**Falsifiable prediction.** Procrustes alignment between codebooks of
the same model on different calibration distributions yields a
rotation distance below 0.1 (already verified at U5 — calibration
deviations differ by ~0.003). Cross-model alignment between two
sphere-solvers trained on the same data yields a rotation distance
below 0.3 (predicted, not yet measured). If cross-model alignment
turns out to be near-orthogonal random, codebook structure is
data-driven not architecture-driven, and the solver's "intrinsic"
status is overstated.
---
## 5. Distillation
Two directions, distinct enough to enumerate separately.
### 5a. Distillation INTO sphere-solvers
**The utility.** Train a sphere-solver student to match a non-Omega
teacher's representations. Student inherits regime-flatness
automatically; teacher's representational quality flows into a
deployable encoder that handles arbitrary resolution without extra
machinery.
**Why Omega.** Standard distillation produces a student whose
behavior interpolates the teacher's at training scale. A
sphere-solver student, by virtue of its architecture, additionally
inherits regime-flatness — the student behaves consistently at
inference scales the teacher was never tested on. This is a
distillation result that wouldn't follow from teacher quality alone.
**Specific construction.** Loss combines reconstruction (the
sphere-solver's native objective) with representation matching
against the teacher's pooled features at intermediate resolution.
Student emerges with both teacher-like representations AND
resolution-agnosticism. Teacher candidates: CLIP, DINOv2, Whisper
(per the Bertenstein cross-modal alignment work).
**Falsifiable prediction.** A sphere-solver student distilled from
DINOv2 at 224×224 produces representations that, when evaluated on a
standard linear-probe benchmark at 448×448, match or exceed direct
DINOv2 at 448×448. If the student degrades at non-training scale
the way the teacher does, distillation didn't transfer
regime-flatness — it transferred only representational quality, and
the architectural Omega property is more fragile than the
training-from-scratch case suggests.
### 5b. Distillation FROM sphere-solvers (codebook freezing)
**The utility.** Extract a codebook artifact, freeze it, train cheap
downstream models that consume codebook activations rather than
re-running the encoder. The codebook becomes a portable feature
vocabulary; downstream models are 1-2 orders of magnitude smaller.
**Why Omega.** U5's verdict (as_is_packaging) makes this trivially
feasible — codebooks are stable artifacts, model-intrinsic and
calibration-insensitive. The downstream model never sees the original
encoder; it only sees activation patterns over a fixed vocabulary.
Resolution-agnosticism is inherited because the codebook is the same
at every scale.
**Specific construction.** Pipeline: (1) extract codebook once, save
as safetensors+JSON. (2) Pre-compute activation patterns for
training corpus. (3) Train any standard architecture (MLP, small
transformer, CNN) with axis activations as input. Codebook stays
frozen forever after step 1.
**Falsifiable prediction.** Already validated by U5 + the geolip-core
pipeline. Failure mode would be: a downstream model trained on
codebook activations underperforms an end-to-end model of similar
parameter count. Predicted not to fail in the regime-flat use case
(where end-to-end models lack regime-flatness anyway), but might fail
in the standard fixed-resolution regime where end-to-end has free
parameter advantage.
---
## 6. Tokenization for downstream LLMs / multimodal models
**The utility.** The codebook is a discrete vocabulary of size
`n_axes` (typically 27–230). Images → axis activation sequences →
discrete tokens fed to autoregressive language models. The geolip-svae
becomes an image tokenizer for the existing multimodal-LLM ecosystem.
**Why Omega.** Three properties matter. Vocabulary size is small
compared to standard learned image tokenizers (VQ-VAE typically
~8K-16K codes); axis count being ~30 means a 512-token-budget LLM can
attend to ~17 patches, or with top-k=4 mixture per patch, the same
budget covers ~128 patches. Resolution-agnosticism means the same
tokenizer handles any input image without retraining. Calibration
insensitivity means the tokenizer is a fixed component, not a
learned-per-task module.
**Specific construction.** Wrap codebook quantization as a tokenizer
class with `encode(image) → token_sequence` and `decode(token_sequence,
target_size) → image` methods. Define special tokens for image-start,
image-end, optionally row-start markers for spatial structure.
Integrate via standard transformers/HuggingFace tokenizer interface.
**Falsifiable prediction.** A small (~100M param) decoder-only LLM
trained on text + axis-token sequences performs image captioning at
the same quality as CLIP+LLM with comparable compute. If quality is
significantly lower, axis tokenization is losing image content that
continuous embeddings preserve, and the discreteness has a real
cost. If quality matches, the small vocabulary is a free reduction
in token budget for image content.
---
## 7. Anomaly / OOD detection
**The utility.** Self-validating inference. Compute the codebook of
the input itself (not the model's reference codebook) and measure
deviation from the reference. Inputs whose induced codebook
substantially deviates from the model's training-derived codebook
are out-of-distribution; the deviation magnitude is the OOD score.
**Why Omega.** A regime-flat model has a well-defined "in-distribution"
surface in codebook space. The `is_projective_clean` check already
captures this internally for codebook validation. Inverted, the same
machinery becomes an inference-time validity flag: every prediction
ships with a confidence signal derived from the input's geometric
compatibility with the codebook.
**Specific construction.** At inference, extract a per-batch codebook
from the input M tensor and compute Procrustes distance to the
attached reference codebook. Add to InferenceEngine as
`engine.validity_score(images) → float` and threshold-based
`engine.predict_with_confidence(images) → (recon, confidence)`.
The throughput sweep already shows MSE ratio is a candidate validity
signal — Procrustes distance on a per-batch codebook is the
finer-grained version.
**Falsifiable prediction.** Inputs with codebook Procrustes distance
> 0.5 from reference produce reconstructions with MSE > 5× native
floor. If correlation between codebook deviation and reconstruction
quality is weak (correlation < 0.5), the codebook deviation is
measuring something independent of model competence, and it isn't a
useful inference-time validity signal.
---
## 8. Cross-modal alignment
**The utility.** Multiple sphere-solvers trained on different
modalities (image, audio, text-as-noise) project into compatible
codebook spaces after Procrustes alignment. Cross-modal retrieval,
joint generation, and modality translation operate in shared axis
space rather than via a learned joint embedding.
**Why Omega.** The Bertenstein work demonstrated this with frozen
expert encoders projecting through a shared text hub. Today's finding
strengthens the claim: cross-modal alignment is *between codebooks*
(deterministic artifacts) rather than between learned projections.
Each modality's sphere-solver produces a codebook on its own
ℝP^(D-1); alignment is a fixed rotation, not a trained mapping.
**Specific construction.** Train sphere-solvers per modality. Extract
codebooks. Compute pairwise Procrustes alignments to a chosen
reference modality. At inference, project inputs through their native
sphere-solver, apply the cross-modal rotation, and operate in shared
axis space. No joint training required after the per-modality stage.
**Falsifiable prediction.** Image-text retrieval via codebook
alignment matches CLIP-style joint-embedding retrieval at comparable
compute on standard benchmarks (MS-COCO, Flickr30K). If retrieval is
significantly worse, cross-modal information lives in the relations
*between* codebook activations rather than in the codebooks
themselves, and the alignment-only approach is missing structure that
joint training captures.
---
## 9. Self-supervised pretraining recipes
**The utility.** Bootstrap foundation models on structured noise
alone. The h2-64 batteries already train on noise distributions and
develop projective-clean codebooks; this generalizes to a recipe for
training sphere-solver foundation models without curated real-world
data.
**Why Omega.** The projective-axis codebook emerges deterministically
from sphere-normalized SVD training, regardless of input distribution
(per U5: gaussian and sixteen-noise calibrations produce essentially
identical codebooks for the same model). The model's geometric
substrate is largely independent of training corpus identity. This
suggests a useful inverse: a foundation model can be pretrained on
synthetic/structured noise and then fine-tuned to specific modalities
via the cross-modal alignment recipe (Section 8).
**Specific construction.** Define a noise curriculum that exercises
the geometric primitives — gaussian, fractal, structured-but-random,
adversarial noise. Train sphere-solver to high reconstruction quality
on this curriculum. Verify the codebook is projective-clean (built-in
quality check). Release as foundation model.
**Falsifiable prediction.** A sphere-solver foundation model
pretrained on noise alone, fine-tuned on ImageNet via 1% of the
parameters (a small adapter on top of the frozen encoder), matches
or exceeds equivalent-compute models pretrained directly on
ImageNet. If noise-pretraining produces worse downstream performance
than ImageNet-pretraining at fixed compute, the geometric substrate
isn't sufficient on its own — there's content in real-world
distributions the model needs to see during pretraining to learn
effectively.
---
## 10. Continual learning / model-merging
**The utility.** Codebooks from independently-trained models are
comparable artifacts. Merging two models = aligning their codebooks
via Procrustes, optionally extending the joint axis set to cover
union-of-features. Continual learning becomes "extend the codebook
when novel structure appears" rather than "retrain to incorporate new
data."
**Why Omega.** Model identity in the geolip-svae family is largely
captured by the codebook (calibration insensitivity confirms this).
Two models trained on different distributions but the same
architecture have different codebooks; aligning them via Procrustes
gives a principled way to combine them without the parameter
interference that plagues standard model-merging methods.
**Specific construction.** Operations on Codebook artifacts:
- `Codebook.merge(other) → Codebook`: union of axes after Procrustes
alignment, with antipodal-pair re-collapse to deduplicate
- `Codebook.diff(other) → axes`: axes in `self` that don't have a
near-equivalent in `other` after alignment — the novel structure
- `Codebook.extend(novel_axes) → Codebook`: append new axes,
re-validate projective-cleanness
- Continual learning loop: train, extract codebook, diff against
prior codebook, decide whether to keep new axes, re-emit updated
codebook.
**Falsifiable prediction.** Two h2-64 batteries (different noise
distributions) merge into a combined codebook with deviation in the
0.20–0.23 CV band. If the merge produces a codebook that *fails*
projective-cleanness, the two codebooks live on incompatible
projective subspaces and merging is not just a Procrustes alignment
— there's content-level interference that requires retraining.
---
## What this clause does NOT cover
Excluded by methodology — these are useful applications of geolip-svae
but do not depend on the Omega substrate in a load-bearing way:
- **Standard feature extraction** for downstream tasks where the input
resolution and modality are fixed. Any encoder can do this; nothing
Omega-dependent.
- **Adversarial robustness** as a downstream goal. Possibly correlated
with codebook quality but not enabled by it specifically.
- **Reinforcement learning state representations.** The geometric
substrate provides nothing the RL community can't get from a
standard VAE.
- **Generative pretraining for autoregressive language modeling.**
Sphere-solvers are not autoregressive; pathway from this substrate
to LLM pretraining is speculative.
---
## Build-order considerations
If utilities will be built in sequence rather than parallel, the
priority ordering by *information value per build* is:
1. **§7 OOD detection** — already mostly present in the codebook
machinery, easiest to ship. Validates the validity-flag framing
from this morning's framing pivot.
2. **§5b distillation FROM sphere-solvers** — also mostly present,
needs only API wrapping. Demonstrates the codebook as portable
artifact for the public release.
3. **§4 solving primitives** — exposes the model's identity claim
directly. The `project / align / solve_basis` triple is a clean
API surface.
4. **§1 classification** — first non-trivial test of regime-flatness
beyond reconstruction. Falsifiable prediction is sharp.
5. **§6 tokenization** — bridge to mainstream multimodal architectures.
Higher build cost but high impact for adoption.
6. **§8 cross-modal alignment** — extends Bertenstein under the new
framing. Build cost is moderate; depends on having multiple
modality-specific sphere-solvers trained.
7. **§5a distillation INTO sphere-solvers** — significant training
investment. Defer until after smaller utilities validate.
8. **§2 diffusion** — substantial build, novel pathway, high uncertainty.
Worth doing once the codebook artifact patterns are mature.
9. **§9 self-supervised pretraining** — biggest investment, most
speculative, but if it works it's the largest payoff.
10. **§3 processing** — depends on §1 + §2 maturity for activation
edits to be principled. Last in sequence.
11. **§10 model-merging** — research utility rather than deployment
utility. Useful when there are many trained sphere-solvers to
consolidate.
The first three are all near-term and reuse existing machinery;
together they constitute a release-ready feature set. The remainder
are the multi-month research agenda.