Create OMEGA_PROGRESSION.md

a9195c3 verified 29 days ago

21.5 kB

	# Potential Downstream Utilities Clause

	Status: Forward-looking. Each utility takes the Omega substrate as a
	load-bearing assumption — regime-independence of reconstruction quality
	across input scale, the projective-axis codebook as a deterministic
	property of trained sphere-solvers, and hardware-determined throughput
	limits independent of model behavior. Utilities that would work
	equivalently on any encoder are excluded; this is a list of capabilities
	that are enabled by Omega, not capabilities incidentally compatible
	with it.

	Methodology. Per the post-000108 research stage, every utility
	section ends with a falsifiable prediction — what would have to be true
	for the utility to NOT work. Construction precedes proof. The first
	build that fails its prediction tells us where the substrate's
	boundary actually is.

	---

	## 1. Classification

	The utility. A projective codebook of `n_axes` directions on
	ℝP^(D-1) is a vocabulary of feature primitives. Image → patch grid → M
	tensor → per-patch projection onto codebook axes → activation pattern
	of shape `[B, n_patches, V, n_axes]`. A linear or shallow head over
	this representation performs classification.

	Why Omega. The codebook is model-intrinsic and regime-flat. A
	classifier trained on activation patterns at 64×64 should generalize
	to 512×512 inputs at inference without retraining, because the
	codebook itself doesn't change with input size. Standard CLIP-style
	models do not give this property — their representations drift with
	input resolution; their pooling operations bake in a particular spatial
	extent.

	Specific construction. Train classifier head on per-patch axis
	activations averaged across patches (or attended-over). For
	fine-grained tasks, retain the spatial structure: classifier sees the
	full `[n_patches, n_axes]` matrix as a 2D feature map. Per-patch
	aggregation already validated in scratchpad 000104 — patch_idx=0 fails
	because it discards spatial signal; patch-mean recovers most of the
	gap.

	Falsifiable prediction. A classifier trained on 64×64 activation
	patterns achieves comparable accuracy on 512×512 test inputs (within
	2 percentage points) without any architectural adaptation. If accuracy
	drops sharply with input resolution, the codebook activations are not
	in fact regime-invariant in the way reconstruction is, and Omega
	covers reconstruction but not classification — a meaningful boundary.

	---

	## 2. Diffusion

	The utility. Discrete diffusion in axis-index space. Each patch's
	M-tensor row gets quantized to its nearest codebook axis (or top-k
	mixture). The "noise" process is gradual randomization of axis
	assignments; the "denoise" process is a transformer that predicts
	axis indices from corrupted sequences. Sampling = run denoiser to
	clean axis sequence → reconstruct image via codebook → decoder.

	Why Omega. Three properties combine here. The codebook is a
	finite, deterministic vocabulary, so discrete diffusion is well-defined
	without extra quantizer training. The decoder is regime-flat, so a
	diffusion model trained on 64×64 axis sequences can sample at any
	resolution by predicting longer sequences and decoding at the target
	size. The codebook's projective structure means antipodal axes carry
	equivalent information — meaningfully reduces the effective
	vocabulary size for the diffusion target.

	Specific construction. Diffusion target: `[n_patches, top_k]`
	discrete indices into codebook. Loss: cross-entropy over axis indices.
	Backbone: any transformer that handles variable-length token sequences
	(patch count varies with target resolution). Conditioning: optional
	class label or text embedding via cross-attention.

	Falsifiable prediction. A diffusion model trained on 64×64 axis
	sequences from h2-64 produces coherent samples at 256×256 by sampling
	longer sequences and decoding at the target size, without retraining.
	If samples at non-native resolution show mode collapse or boundary
	artifacts beyond what the encoder-decoder pair produces directly,
	the codebook's discreteness is interfering with the regime-flat
	reconstruction — narrower than expected.

	---

	## 3. Processing (image-to-image edits in axis space)

	The utility. Operations applied to codebook activations rather
	than pixels. Image → encode → edit activations → decode. Style
	transfer, denoising, inpainting, semantic editing all become
	manipulations of the `[n_patches, V, n_axes]` activation tensor,
	followed by reconstruction.

	Why Omega. Edits made at one resolution are coherent when decoded
	at another, because the codebook is the same vocabulary at every
	scale. A 64×64 inpaint mask can produce a 512×512 inpainted output by
	upsampling the edited activations and decoding at the target size.
	Critically, the activation edits respect the geometric constraints
	that produced the codebook — operations that move activations off
	the codebook produce reconstruction artifacts that are themselves a
	useful signal.

	Specific construction. Define edit operations as activation-tensor
	transformations: zero-out (denoise), substitute axis-set (style
	transfer), spatial-gather + redistribute (inpaint), interpolate
	between two images' activations (semantic morph). Provide a
	`process_at_scale` API mirroring `reconstruct_at_scale`.

	Falsifiable prediction. Style transfer applied to 64×64
	activations and decoded at 512×512 produces output indistinguishable
	in style consistency from the same operation applied directly to a
	512×512 encoding. If the upsampled-edit path produces worse style
	transfer than the direct-encode path, the activation upsampling is
	losing geometric structure that the encoder captures — and Omega's
	regime-flatness has a stricter envelope than reconstruction MSE
	alone reveals.

	---

	## 4. Solving

	The utility. The most direct framing: use the trained sphere-solver
	to solve geometric problems on its native manifold. Given a set of
	points in ℝ^D, encode them via the model's projection path to get
	their representation on RP^(D-1). Given a set of vectors, solve for
	the codebook axes that span them. Given two sets of points, find the
	optimal projective alignment via Procrustes on their codebooks.

	Why Omega. This is the closest utility to the model's identity
	claim. The model is named "sphere-solver" because that's what it is —
	a parametric solver for "what's the best projective representation of
	this data on the unit sphere?" The Omega finding is that this solver
	is regime-independent: the same machinery handles 64 input points or
	65,536 input points and produces structurally consistent answers.

	Specific construction. Expose three solver primitives:
	- `project(points, model) → axes`: encode arbitrary point clouds via
	the model's encoder to get their codebook representation
	- `align(codebook_a, codebook_b) → rotation`: Procrustes-align two
	codebooks (already implemented in tests/framework.py)
	- `solve_basis(target_vectors, model) → axis_indices`: given target
	vectors, find the codebook axes that best span them

	Falsifiable prediction. Procrustes alignment between codebooks of
	the same model on different calibration distributions yields a
	rotation distance below 0.1 (already verified at U5 — calibration
	deviations differ by ~0.003). Cross-model alignment between two
	sphere-solvers trained on the same data yields a rotation distance
	below 0.3 (predicted, not yet measured). If cross-model alignment
	turns out to be near-orthogonal random, codebook structure is
	data-driven not architecture-driven, and the solver's "intrinsic"
	status is overstated.

	---

	## 5. Distillation

	Two directions, distinct enough to enumerate separately.

	### 5a. Distillation INTO sphere-solvers

	The utility. Train a sphere-solver student to match a non-Omega
	teacher's representations. Student inherits regime-flatness
	automatically; teacher's representational quality flows into a
	deployable encoder that handles arbitrary resolution without extra
	machinery.

	Why Omega. Standard distillation produces a student whose
	behavior interpolates the teacher's at training scale. A
	sphere-solver student, by virtue of its architecture, additionally
	inherits regime-flatness — the student behaves consistently at
	inference scales the teacher was never tested on. This is a
	distillation result that wouldn't follow from teacher quality alone.

	Specific construction. Loss combines reconstruction (the
	sphere-solver's native objective) with representation matching
	against the teacher's pooled features at intermediate resolution.
	Student emerges with both teacher-like representations AND
	resolution-agnosticism. Teacher candidates: CLIP, DINOv2, Whisper
	(per the Bertenstein cross-modal alignment work).

	Falsifiable prediction. A sphere-solver student distilled from
	DINOv2 at 224×224 produces representations that, when evaluated on a
	standard linear-probe benchmark at 448×448, match or exceed direct
	DINOv2 at 448×448. If the student degrades at non-training scale
	the way the teacher does, distillation didn't transfer
	regime-flatness — it transferred only representational quality, and
	the architectural Omega property is more fragile than the
	training-from-scratch case suggests.

	### 5b. Distillation FROM sphere-solvers (codebook freezing)

	The utility. Extract a codebook artifact, freeze it, train cheap
	downstream models that consume codebook activations rather than
	re-running the encoder. The codebook becomes a portable feature
	vocabulary; downstream models are 1-2 orders of magnitude smaller.

	Why Omega. U5's verdict (as_is_packaging) makes this trivially
	feasible — codebooks are stable artifacts, model-intrinsic and
	calibration-insensitive. The downstream model never sees the original
	encoder; it only sees activation patterns over a fixed vocabulary.
	Resolution-agnosticism is inherited because the codebook is the same
	at every scale.

	Specific construction. Pipeline: (1) extract codebook once, save
	as safetensors+JSON. (2) Pre-compute activation patterns for
	training corpus. (3) Train any standard architecture (MLP, small
	transformer, CNN) with axis activations as input. Codebook stays
	frozen forever after step 1.

	Falsifiable prediction. Already validated by U5 + the geolip-core
	pipeline. Failure mode would be: a downstream model trained on
	codebook activations underperforms an end-to-end model of similar
	parameter count. Predicted not to fail in the regime-flat use case
	(where end-to-end models lack regime-flatness anyway), but might fail
	in the standard fixed-resolution regime where end-to-end has free
	parameter advantage.

	---

	## 6. Tokenization for downstream LLMs / multimodal models

	The utility. The codebook is a discrete vocabulary of size
	`n_axes` (typically 27–230). Images → axis activation sequences →
	discrete tokens fed to autoregressive language models. The geolip-svae
	becomes an image tokenizer for the existing multimodal-LLM ecosystem.

	Why Omega. Three properties matter. Vocabulary size is small
	compared to standard learned image tokenizers (VQ-VAE typically
	~8K-16K codes); axis count being ~30 means a 512-token-budget LLM can
	attend to ~17 patches, or with top-k=4 mixture per patch, the same
	budget covers ~128 patches. Resolution-agnosticism means the same
	tokenizer handles any input image without retraining. Calibration
	insensitivity means the tokenizer is a fixed component, not a
	learned-per-task module.

	Specific construction. Wrap codebook quantization as a tokenizer
	class with `encode(image) → token_sequence` and `decode(token_sequence,
	target_size) → image` methods. Define special tokens for image-start,
	image-end, optionally row-start markers for spatial structure.
	Integrate via standard transformers/HuggingFace tokenizer interface.

	Falsifiable prediction. A small (~100M param) decoder-only LLM
	trained on text + axis-token sequences performs image captioning at
	the same quality as CLIP+LLM with comparable compute. If quality is
	significantly lower, axis tokenization is losing image content that
	continuous embeddings preserve, and the discreteness has a real
	cost. If quality matches, the small vocabulary is a free reduction
	in token budget for image content.

	---

	## 7. Anomaly / OOD detection

	The utility. Self-validating inference. Compute the codebook of
	the input itself (not the model's reference codebook) and measure
	deviation from the reference. Inputs whose induced codebook
	substantially deviates from the model's training-derived codebook
	are out-of-distribution; the deviation magnitude is the OOD score.

	Why Omega. A regime-flat model has a well-defined "in-distribution"
	surface in codebook space. The `is_projective_clean` check already
	captures this internally for codebook validation. Inverted, the same
	machinery becomes an inference-time validity flag: every prediction
	ships with a confidence signal derived from the input's geometric
	compatibility with the codebook.

	Specific construction. At inference, extract a per-batch codebook
	from the input M tensor and compute Procrustes distance to the
	attached reference codebook. Add to InferenceEngine as
	`engine.validity_score(images) → float` and threshold-based
	`engine.predict_with_confidence(images) → (recon, confidence)`.
	The throughput sweep already shows MSE ratio is a candidate validity
	signal — Procrustes distance on a per-batch codebook is the
	finer-grained version.

	Falsifiable prediction. Inputs with codebook Procrustes distance
	> 0.5 from reference produce reconstructions with MSE > 5× native
	floor. If correlation between codebook deviation and reconstruction
	quality is weak (correlation < 0.5), the codebook deviation is
	measuring something independent of model competence, and it isn't a
	useful inference-time validity signal.

	---

	## 8. Cross-modal alignment

	The utility. Multiple sphere-solvers trained on different
	modalities (image, audio, text-as-noise) project into compatible
	codebook spaces after Procrustes alignment. Cross-modal retrieval,
	joint generation, and modality translation operate in shared axis
	space rather than via a learned joint embedding.

	Why Omega. The Bertenstein work demonstrated this with frozen
	expert encoders projecting through a shared text hub. Today's finding
	strengthens the claim: cross-modal alignment is between codebooks
	(deterministic artifacts) rather than between learned projections.
	Each modality's sphere-solver produces a codebook on its own
	ℝP^(D-1); alignment is a fixed rotation, not a trained mapping.

	Specific construction. Train sphere-solvers per modality. Extract
	codebooks. Compute pairwise Procrustes alignments to a chosen
	reference modality. At inference, project inputs through their native
	sphere-solver, apply the cross-modal rotation, and operate in shared
	axis space. No joint training required after the per-modality stage.

	Falsifiable prediction. Image-text retrieval via codebook
	alignment matches CLIP-style joint-embedding retrieval at comparable
	compute on standard benchmarks (MS-COCO, Flickr30K). If retrieval is
	significantly worse, cross-modal information lives in the relations
	between codebook activations rather than in the codebooks
	themselves, and the alignment-only approach is missing structure that
	joint training captures.

	---

	## 9. Self-supervised pretraining recipes

	The utility. Bootstrap foundation models on structured noise
	alone. The h2-64 batteries already train on noise distributions and
	develop projective-clean codebooks; this generalizes to a recipe for
	training sphere-solver foundation models without curated real-world
	data.

	Why Omega. The projective-axis codebook emerges deterministically
	from sphere-normalized SVD training, regardless of input distribution
	(per U5: gaussian and sixteen-noise calibrations produce essentially
	identical codebooks for the same model). The model's geometric
	substrate is largely independent of training corpus identity. This
	suggests a useful inverse: a foundation model can be pretrained on
	synthetic/structured noise and then fine-tuned to specific modalities
	via the cross-modal alignment recipe (Section 8).

	Specific construction. Define a noise curriculum that exercises
	the geometric primitives — gaussian, fractal, structured-but-random,
	adversarial noise. Train sphere-solver to high reconstruction quality
	on this curriculum. Verify the codebook is projective-clean (built-in
	quality check). Release as foundation model.

	Falsifiable prediction. A sphere-solver foundation model
	pretrained on noise alone, fine-tuned on ImageNet via 1% of the
	parameters (a small adapter on top of the frozen encoder), matches
	or exceeds equivalent-compute models pretrained directly on
	ImageNet. If noise-pretraining produces worse downstream performance
	than ImageNet-pretraining at fixed compute, the geometric substrate
	isn't sufficient on its own — there's content in real-world
	distributions the model needs to see during pretraining to learn
	effectively.

	---

	## 10. Continual learning / model-merging

	The utility. Codebooks from independently-trained models are
	comparable artifacts. Merging two models = aligning their codebooks
	via Procrustes, optionally extending the joint axis set to cover
	union-of-features. Continual learning becomes "extend the codebook
	when novel structure appears" rather than "retrain to incorporate new
	data."

	Why Omega. Model identity in the geolip-svae family is largely
	captured by the codebook (calibration insensitivity confirms this).
	Two models trained on different distributions but the same
	architecture have different codebooks; aligning them via Procrustes
	gives a principled way to combine them without the parameter
	interference that plagues standard model-merging methods.

	Specific construction. Operations on Codebook artifacts:
	- `Codebook.merge(other) → Codebook`: union of axes after Procrustes
	alignment, with antipodal-pair re-collapse to deduplicate
	- `Codebook.diff(other) → axes`: axes in `self` that don't have a
	near-equivalent in `other` after alignment — the novel structure
	- `Codebook.extend(novel_axes) → Codebook`: append new axes,
	re-validate projective-cleanness
	- Continual learning loop: train, extract codebook, diff against
	prior codebook, decide whether to keep new axes, re-emit updated
	codebook.

	Falsifiable prediction. Two h2-64 batteries (different noise
	distributions) merge into a combined codebook with deviation in the
	0.20–0.23 CV band. If the merge produces a codebook that fails
	projective-cleanness, the two codebooks live on incompatible
	projective subspaces and merging is not just a Procrustes alignment
	— there's content-level interference that requires retraining.

	---

	## What this clause does NOT cover

	Excluded by methodology — these are useful applications of geolip-svae
	but do not depend on the Omega substrate in a load-bearing way:

	- Standard feature extraction for downstream tasks where the input
	resolution and modality are fixed. Any encoder can do this; nothing
	Omega-dependent.
	- Adversarial robustness as a downstream goal. Possibly correlated
	with codebook quality but not enabled by it specifically.
	- Reinforcement learning state representations. The geometric
	substrate provides nothing the RL community can't get from a
	standard VAE.
	- Generative pretraining for autoregressive language modeling.
	Sphere-solvers are not autoregressive; pathway from this substrate
	to LLM pretraining is speculative.

	---

	## Build-order considerations

	If utilities will be built in sequence rather than parallel, the
	priority ordering by information value per build is:

	1. §7 OOD detection — already mostly present in the codebook
	machinery, easiest to ship. Validates the validity-flag framing
	from this morning's framing pivot.
	2. §5b distillation FROM sphere-solvers — also mostly present,
	needs only API wrapping. Demonstrates the codebook as portable
	artifact for the public release.
	3. §4 solving primitives — exposes the model's identity claim
	directly. The `project / align / solve_basis` triple is a clean
	API surface.
	4. §1 classification — first non-trivial test of regime-flatness
	beyond reconstruction. Falsifiable prediction is sharp.
	5. §6 tokenization — bridge to mainstream multimodal architectures.
	Higher build cost but high impact for adoption.
	6. §8 cross-modal alignment — extends Bertenstein under the new
	framing. Build cost is moderate; depends on having multiple
	modality-specific sphere-solvers trained.
	7. §5a distillation INTO sphere-solvers — significant training
	investment. Defer until after smaller utilities validate.
	8. §2 diffusion — substantial build, novel pathway, high uncertainty.
	Worth doing once the codebook artifact patterns are mature.
	9. §9 self-supervised pretraining — biggest investment, most
	speculative, but if it works it's the largest payoff.
	10. §3 processing — depends on §1 + §2 maturity for activation
	edits to be principled. Last in sequence.
	11. §10 model-merging — research utility rather than deployment
	utility. Useful when there are many trained sphere-solvers to
	consolidate.

	The first three are all near-term and reuse existing machinery;
	together they constitute a release-ready feature set. The remainder
	are the multi-month research agenda.