Update README.md

42d70d4 verified about 11 hours ago

24.4 kB

license: mit

Actionable Utility

So far it seems almost all shapes have a potential to teach the system for tasks.

The large array of math will require a streamlined series of sweeps to run in a very optimal environment.

Due to the lack of expensive hardware at my disposal, I have to take drastic steps for this.

The Expert-Tuning Solution

So, I won't TRAIN the models using a pair of experts. However, I can TUNE the settings based on the most likely alignment cascade capacity that the two experts can enable simultaneously with the current build.

So in a sense, the experts will say what the settings are most likely going to be most optimized at, by making a quick soup.

This should provide the necessary yields that I require, assuming I pick experts with relationally similar math. So... parameter narrowing soup for now, eventually the system should be able to directly self-attenuate the parameters for the best suggested parameters at the get-go.

The models themselves for this experiment set will never be trained by the experts, only the params selected by what is most likely. The models will never see an expert opinion directly, nor will they be given gradients from anything expert-related. Everything in a vacuum.

Flows, Routes, Patterns, Trajectories, Magnitudes, Etc

Everything mathematically will have a represented flow attenuation mechanism specifically aligned to the curation of that math.

This will enable two core features, primarily the access to directly attuned flow matching through deep structure. Secondary it will allow for a direct curative control for analysis utilizing invariants in direct diagnostic.

In other words, debug tools.

This will result in a very deep and robust capacity for debug analysis, as well as additional capacity to learn and regulate momentum learning from those observer patterns.

GeoLIP Spectral Encoder — Test Manifest

Geometric Primitives for Constellation-Anchored Classification

Target: CIFAR-10 (baseline), then generalize Constraint: Zero or minimal learned encoder params. All learning in constellation anchors, patchwork, classifier. Metric: Val accuracy, CV convergence, anchor activation, InfoNCE lock, train/val gap Baseline to beat: 88.0% (conv encoder + SquaredReLU + full trainer, 1.6M params) Current best spectral: 46.8% (STFT + Cholesky + SVD, v4, 137K params, CE-only carry)

STATUS KEY

[ ] — Not started
[R] — Running
[X] — Completed
[F] — Failed (with reason)
[S] — Skipped (with reason)
[P] — Partially completed

COMPLETED EXPERIMENTS (prior sessions + this session)

Conv Encoder Baselines (Form 1 Core)

Linear baseline, 100 epochs → 67.0%, 422K params, overfits at E31
MLP baseline, 100 epochs → 65.0%, 687K params, overfits at E10
Core CE-only, 100 epochs → 63.4%, 820K params, CV=0.70, never converges
Core CE+CV, 100 epochs → 62.7%, 820K params, CV=0.61, worse than CE-only
Core 32 anchors, interrupted E20 → 59.2%, 1.8M params, slow convergence
Full trainer GELU, 100 epochs → 88.0%, 1.6M params (original proven result)
Full trainer SquaredReLU, 100 epochs → 88.0%, 1.6M params, E96 best

Spectral Encoder Experiments

[F] Spectral v1: flat FFT → 768-d → single constellation → collapsed
- Cause: concat norm √48≈6.93 vs anchor norm 1, not on same sphere
[F] Spectral v2: per-band constellation (48×64=3072 anchors) → ~35%
- Cause: 3072 tri dims too diffuse, InfoNCE dead at 0.45, no cross-band structure
[F] Spectral v3: FFT → 8 channels (spherical mean) → 128 anchors → 27%
- Cause: cos≈0.99, spherical mean collapsed all images to same point
[P] Spectral v4: STFT + Cholesky + SVD → S^43 → 64 anchors → 46.8% (still running)
- CE carrying alone, CosineEmbeddingLoss frozen at 0.346, InfoNCE dead at 0.15
- Cholesky+SVD signature IS discriminative, contrastive losses unable to contribute

CATEGORY 1: SIGNAL DECOMPOSITION TO GEOMETRY

1.1 Wavelet Scattering Transform (Mallat)

Formula: S_J[p]x(u) = |||x * ψ_{λ₁}| * ψ_{λ₂}| ... | * φ_{2^J}(u) Library: kymatio (pip install kymatio) Github: https://github.com/kymatio/kymatio Expected output: ~10K-dim feature vector for 32×32 Literature baseline: ~82% CIFAR-10 with SVM, ~70.5% with linear Properties: Deterministic, Lipschitz-continuous, approximately energy-preserving

1.1a Scattering order 2, J=2, L=8 → L2 normalize → flat constellation on S^d
- Hypothesis: scattering features are rich enough that flat constellation should work
- Compare: direct linear classifier on scattering vs constellation pipeline
1.1b Scattering → JL projection to S^127 → constellation (64 anchors)
- JL preserves distances; S^127 matches our proven dim
1.1c Scattering → JL → S^43 → Cholesky/SVD signature → constellation
- Stack v4's geometric signature on top of scattering features
1.1d Scattering order 1 vs order 2 ablation
- Order 1 is ~Gabor magnitude; order 2 adds inter-frequency structure
1.1e Scattering + InfoNCE: does augmentation invariance help or hurt?
- Scattering is already translation-invariant; InfoNCE may be redundant
1.1f Scattering hybrid: scattering front-end + lightweight learned projection + constellation
- Test minimal learned params needed to bridge the 82→88% gap

1.2 Gabor Filter Banks

Formula: g(x,y) = exp(−(x'²+γ²y'²)/(2σ²)) · exp(i(2πx'/λ+ψ)) Expected: S scales × K orientations → S×K magnitude responses Properties: Deterministic, O(N·S·K), first-order scattering ≈ Gabor modulus

1.2a Gabor bank (4 scales × 8 orientations = 32 filters) → L2 norm → S^31
- Each filter response is a spatial map; pool to scalar per filter
1.2b Gabor → per-filter spatial statistics (mean, std, skew, kurtosis) → S^127
- 32 filters × 4 stats = 128-d, matches conv encoder output dim
1.2c Gabor vs scattering order 1 A/B test
- Validate that scattering order 1 ≈ Gabor + modulus

1.3 Radon Transform

Formula: Rf(ω,t) = ∫ f(x) δ(x·ω − t) dx Properties: Deterministic, exactly invertible via filtered back-projection

1.3a Radon at K angles → sinogram → L2 norm per angle → K points on S^d
- K angles = K geometric addresses, constellation measures the cloud
1.3b Radon → 1D wavelet per projection (= ridgelet) → aggregate to S^d
- Composition: Radon → Ridgelet, captures linear singularities

1.4 Curvelet Transform

Formula: c_{j,l,k} = ⟨f, φ_{j,l,k}⟩, parabolic scaling: width ≈ length² Properties: Deterministic, exactly invertible (tight frame), O(N² log N)

1.4a Curvelet energy per (scale, orientation) band → L2 norm → S^d
- Captures directional frequency that scattering misses
1.4b Curvelet + scattering concatenation → JL → constellation
- Test complementarity of isotropic (scattering) + anisotropic (curvelet) features

1.5 Persistent Homology (TDA)

Formula: Track birth/death of β₀ (components), β₁ (loops) across filtration Library: giotto-tda or ripser Properties: Deterministic, O(n³), captures topology no other transform sees

1.5a Sublevel set filtration on grayscale → persistence image → L2 norm → S^d
1.5b PH on scattering feature maps (topology of the representation)
- Captures whether scattering features form clusters, loops, voids
1.5c PH Betti curve as additional channel in multi-signature pipeline
1.5d PH standalone classification baseline on CIFAR-10
- Literature suggests ~60-70% standalone; valuable as complementary signal

1.6 STFT Variants (improving v4)

1.6a 2D STFT via patch-wise FFT (overlapping patches) instead of row/col STFT
- True spatial-frequency decomposition vs row+col approximation
1.6b STFT with larger n_fft=32 (current: 16) → more frequency resolution
1.6c STFT preserving phase (not just magnitude) via analytic signal
- Phase encodes spatial structure; current pipeline discards it
1.6d Multi-window STFT (different window sizes for different frequency ranges)

CATEGORY 2: MANIFOLD STRUCTURES

2.1 Hopf Fibration

Formula: h(z₁,z₂) = (2z̄₁z₂, |z₁|²−|z₂|²) : S³ → S² Properties: Deterministic, O(1), hierarchical (base + fiber)

2.1a Encode 4-d feature vectors on S³ → Hopf project to S² + fiber coordinate
- Coarse triangulation on S², fine discrimination in fiber
2.1b Quaternionic Hopf S⁷ → S⁴ for 8-d features
- Natural for 8-channel spectral decomposition (v3/v4 channel count)
2.1c Hopf foliation spherical codes for anchor initialization
- Replace uniform_hypersphere_init with Hopf-structured codes
2.1d Hierarchical constellation: coarse anchors on base S², fine anchors per fiber

2.2 Grassmannian Class Representations

Formula: Class = k-dim subspace of ℝⁿ, distances via principal angles Properties: Requires SVD, O(nk²)

2.2a Replace class vectors with class subspaces on Gr(k,n)
- Each class owns a k-dim subspace; classification = nearest subspace
- Literature: +1.3% on ImageNet over single class vectors
2.2b Grassmannian distance metrics ablation: geodesic vs chordal vs projection
2.2c Per-class anchor subspace: each anchor defines a subspace, not a point

2.3 Flag Manifold (Nested Subspace Hierarchy)

Formula: V₁ ⊂ V₂ ⊂ ... ⊂ Vₖ, nested subspaces Properties: Generalizes Grassmannian, natural for multi-resolution

2.3a Flag decomposition of frequency channels (DC ⊂ low ⊂ mid ⊂ high)
- Test whether nesting constraint improves spectral encoder
2.3b Flag-structured anchors: coarse-to-fine anchor hierarchy

2.4 Von Mises-Fisher Mixture

Formula: f(x; μ, κ) = C_p(κ) exp(κ μᵀx), soft clustering on S^d Properties: Natural density model for hyperspherical data

2.4a Replace hard nearest-anchor assignment with vMF soft posteriors
- p(j|x) = α_j f(x;μ_j,κ_j) / Σ α_k f(x;μ_k,κ_k)
- Learned κ per anchor = adaptive influence radius
2.4b vMF mixture EM for anchor initialization (replace uniform hypersphere init)
2.4c vMF concentration κ as a diagnostic: track per-class κ convergence

2.5 Optimal Anchor Placement

2.5a E₈ lattice anchors for 8-d constellation (240 maximally separated points)
2.5b Spherical t-design initialization vs uniform hypersphere init
2.5c Thomson problem solver for N anchors on S^d (energy minimization)
- Compare: QR + iterative repulsion (current) vs Coulomb energy minimization

CATEGORY 3: COMPACT REPRESENTATIONS

3.1 Random Fourier Features

Formula: z(x) = √(2/D) [cos(ω₁ᵀx+b₁), ..., cos(ωDᵀx+bD)] Properties: Pseudo-deterministic, preserves kernel structure, maps to S^d via cos/sin

3.1a RFF on raw pixels → S^d → constellation
- Baseline: how much does nonlinear kernel approximation help raw pixels?
3.1b RFF on scattering features → constellation
- Composition: scattering (linear invariants) → RFF (nonlinear kernel)
3.1c Fourier feature positional encoding (Tancik/Mildenhall style)
- γ(v) = [cos(2πBv), sin(2πBv)]ᵀ explicitly maps to hypersphere

3.2 Johnson-Lindenstrauss Projection

Formula: f(x) = (1/√k)Ax, preserves distances with k = O(ε⁻² log n) Properties: Pseudo-deterministic, near-isometric

3.2a JL from scattering (~10K) to 128-d → L2 norm → constellation
- Test: does JL + L2 norm preserve enough structure?
3.2b JL target dimension sweep: 32, 64, 128, 256, 512
- Find minimum k where constellation accuracy saturates
3.2c Fast JL (randomized Hadamard) vs Gaussian JL speed/accuracy tradeoff

3.3 Compressed Sensing on Scattering Coefficients

Formula: y = Φx, recover via ℓ₁ minimization if x is k-sparse Properties: Exact recovery for sparse signals, O(k log(N/k)) measurements

3.3a Measure sparsity of scattering coefficients (how many are near-zero?)
- If sparse: CS can compress much more than JL
3.3b CS measurement matrix → L2 norm → constellation
- Compare: CS vs JL at same target dimension

3.4 Spherical Harmonics

Formula: Y_l^m(θ,φ), complete basis on S², (l_max+1)² coefficients Properties: Deterministic, native Fourier on sphere, exactly invertible

3.4a Expand constellation triangulation profile in spherical harmonics
- Which angular frequencies carry discriminative info?
3.4b Spherical harmonic coefficients of embedding distribution as class signature
3.4c Hyperspherical harmonics for S^15 and S^43 (higher-dim generalization)

CATEGORY 4: INVERTIBLE GEOMETRIC TRANSFORMS

4.1 Stereographic Projection

Formula: σ(x) = x_{1:n}/(1−x_{n+1}), σ⁻¹(y) = (2y, ‖y‖²−1)/(‖y‖²+1) Properties: Conformal bijection S^n{pole} ↔ ℝⁿ, preserves angles

4.1a Stereographic → Euclidean scattering → inverse stereographic → S^d
- Apply scattering in flat space, project back to sphere
4.1b Stereographic projection as constellation readout alternative
- Instead of triangulation distances, read local coordinates via stereographic

4.2 Exponential / Logarithmic Maps

Formula: exp_p(v) = cos(‖v‖)·p + sin(‖v‖)·v/‖v‖ Formula: log_p(q) = arccos(⟨q,p⟩) · (q−⟨q,p⟩p)/‖q−⟨q,p⟩p‖ Properties: Deterministic, locally invertible, O(n)

4.2a Replace triangulation (1−cos) with log map coordinates at each anchor
- Log map gives direction + distance in tangent space (richer than scalar distance)
- Each anchor contributes d-dim tangent vector instead of 1-d distance
4.2b Log map triangulation → parallel transport to common tangent space → aggregate
- Geometrically principled alternative to patchwork concatenation

4.3 Parallel Transport

Formula: Γ^q_p(v) = v − (⟨v,p⟩+⟨v,q⟩/(1+⟨p,q⟩))·(p+q) on S^n Properties: Isometric between tangent spaces, exactly invertible

4.3a Compute log maps at K anchors → parallel transport all to north pole → aggregate
- Creates a canonical tangent-space representation independent of anchor positions
4.3b Parallel transport as inter-anchor communication in constellation
- How does the same input look from different anchor tangent spaces?

4.4 Möbius Transformations

Formula: h_ω(z) = (1−‖ω‖²)/‖z−ω‖² − ω Properties: Conformal automorphism of S^d, invertible, O(d)

4.4a Möbius "geometric attention": transform sphere to zoom into anchor regions
- Expand region near anchor, compress far regions
- Each anchor applies its own Möbius transform before measuring distance
4.4b Composition of Möbius transforms as normalizing flow on S^d
- Learned flow that warps embedding distribution toward better separation

4.5 Procrustes + Polar Decomposition

Formula: R* = argmin_R ‖RA−B‖_F = UVᵀ from SVD(BᵀA) Formula: A = UP (rotation × stretch)

4.5a Procrustes-align channel cloud to canonical pose before Cholesky/SVD
- Remove rotation variability, isolate shape information
4.5b Polar decomposition of channel matrix: U (rotation) + P (stretch) as separate features
- U encodes orientation of frequency cloud; P encodes shape/scale
- Both are geometric, both are deterministic from the channel matrix

CATEGORY 5: MATRIX DECOMPOSITION SIGNATURES

5.1 Already Tested

Cholesky of Gram matrix → 36 lower-tri values (in v4, working)
SVD singular values → 8 values (in v4, working)
Concatenated 44-d signature on S^43 → 46.8% with CE-only

5.2 Remaining Decompositions

5.2a QR decomposition: Q (rotation) and R diagonal (scale per channel)
- R diagonal = per-channel magnitude; Q = inter-channel angular structure
5.2b Schur decomposition: T diagonal = eigenvalues, T off-diagonal = coupling
- For the Gram matrix: Schur gives eigenstructure in triangular form
5.2c Eigendecomposition of Gram: eigenvalues as spectral signature
- Compare: eigenvalues vs SVD singular values vs Cholesky diagonal
- These are related but not identical (λ_i = σ_i² for Gram = AᵀA)
5.2d NMF of magnitude spectrum: parts-based decomposition
- Requires iterative optimization (not fully deterministic)
- But finds additive, non-negative parts — texture components
5.2e Tucker tensor decomposition of spatial×frequency×channel tensor
- 3D structure: (H, W, freq_bins) per color channel
- Core tensor encodes interactions between spatial, frequency, channel modes

CATEGORY 6: INFORMATION-THEORETIC LOSSES

6.1 Already Tested

InfoNCE (self-contrastive, two augmented views) — dead at 0.15 in spectral v4
CosineEmbeddingLoss — frozen at 0.346 (margin-saturated)
CV loss (Cayley-Menger volume) — running but not in 0.18-0.25 band

6.2 Loss Modifications

6.2a Drop contrastive losses entirely, CE-only + geometric losses
- v4 shows CE is the only contributor; contrastive is dead weight
- Hypothesis: removing dead losses may speed convergence
6.2b Class-conditional InfoNCE: positive = same class, not same image
- Requires labels but gives much stronger supervision signal
6.2c vMF-based contrastive loss: replace dot-product similarity with vMF log-likelihood
- κ-adaptive: high-κ for nearby pairs, low-κ for far pairs
6.2d Fisher-Rao distance as loss: d_FR(p,q) = 2·arccos(∫√(pq))
- Natural distance for distributions on the sphere
6.2e Sliced spherical Wasserstein distance as distribution matching loss
- Matches embedding distribution to target (e.g., uniform on sphere)
6.2f Geometric autograd (from GM3): tangential projection + separation preservation
- Adam + geometric autograd > AdamW on geometric tasks (proven)
- Operates on gradient direction, not loss value

6.3 Anchor Management

6.3a Anchor push frequency sweep: every 10, 25, 50, 100, 200 batches
6.3b Anchor push with vMF-weighted centroids instead of hard class centroids
6.3c Anchor birth/death: add anchors where density is high, remove where unused
6.3d Anchor dropout sweep: 0%, 5%, 15%, 30%, 50%

CATEGORY 7: COMPOSITE PIPELINE TESTS

7.1 The Reference Pipeline (from research article)

7.1a Scattering(J=2,L=8) → JL(128) → L2 norm → constellation(64) → classify
- The "canonical" pipeline; expected ~75-80% based on literature
7.1b Same as 7.1a but with learned 2-layer projection replacing JL
- Minimal learned params (~16K), test if projection adaptation matters
7.1c Scattering → curvelet energy → concat → JL → constellation
- Test complementarity

7.2 Hybrid: Spectral + Scattering

7.2a STFT channels (v4) + scattering features → concat → JL → S^d → constellation
- STFT gives spatial-frequency; scattering gives multi-scale invariants
7.2b Scattering → Cholesky Gram + SVD signature → constellation
- Apply v4's geometric signature to scattering output instead of STFT

7.3 Multi-Signature Constellation

7.3a Parallel extraction: scattering + Gabor + Radon → separate constellations → fusion
- Each primitive captures different geometric aspect
- Fusion: concatenate patchwork outputs → shared classifier
7.3b Hierarchical constellation: scattering → coarse anchors → residual → fine anchors
- Two-stage: first stage identifies broad category, second refines

7.4 Minimal Learned Params Tests

7.4a Best deterministic pipeline + 1 learned linear layer (d_in → 128) before constellation
- Measure: how much does a single projection layer help?
- Count: exact learned param count
7.4b Same as 7.4a but with SquaredReLU + LayerNorm (the proven patchwork block)
7.4c Sweep learned projection sizes: 0, 1K, 5K, 10K, 50K, 100K params
- Find the elbow where adding params stops helping

PRIORITY QUEUE (recommended execution order)

Tier 1: Highest Expected Impact

1.1a — Scattering + flat constellation (the literature leader)
1.1b — Scattering + JL → S^127 + constellation
6.2a — Drop dead contrastive losses from v4, measure CE-only ceiling
2.4a — vMF soft assignment replacing hard nearest-anchor
4.2a — Log map triangulation (richer than scalar distance)

Tier 2: High Expected Impact

7.1a — Full reference pipeline
1.1f — Scattering hybrid with minimal learned projection
1.2b — Gabor spatial statistics → S^127
5.2c — Eigendecomposition vs SVD vs Cholesky ablation
2.1b — Quaternionic Hopf S⁷→S⁴ for 8-channel data

Tier 3: Exploratory

1.5a — Persistent homology standalone
3.1b — RFF on scattering features
4.4a — Möbius geometric attention
7.3a — Multi-signature parallel constellations
2.2a — Grassmannian class subspaces

Tier 4: Deep Exploration

1.3a — Radon cloud on S^d
1.4b — Curvelet + scattering concat
2.3a — Flag decomposition of frequency channels
4.3a — Parallel transport aggregation
3.4c — Hyperspherical harmonics analysis

RUNNING SCOREBOARD

Experiment	Val Acc	Params (learned)	CV	Anchors Active	InfoNCE	Key Finding
Linear baseline	67.0%	423K	—	—	—	Overfits E31
MLP baseline	65.0%	687K	—	—	—	Overfits E10
Core CE-only	63.4%	820K	0.70	—	—	CV never converges
Core CE+CV	62.7%	820K	0.61	—	—	CV hurts accuracy
Full GELU	88.0%	1.6M	0.14-0.17	64/64	1.00	Reference
Full SquaredReLU	88.0%	1.6M	0.15	64/64	1.00	Matches GELU
Spectral v1 (flat FFT)	FAIL	—	—	1/64	—	Norm mismatch
Spectral v2 (per-band)	~35%	1.2M	0.17-0.19	900/3072	0.45	Too diffuse
Spectral v3 (sph mean)	~27%	130K	0.27-0.34	110/128	0.35	Collapsed to point
Spectral v4 (STFT+Chol+SVD)	46.8%	137K	0.52-0.66	53/64	0.15	CE-only carry
Scattering baseline	~82%*	0	—	—	—	Literature (SVM)

Italicized entries are literature values, not our runs

NOTES & INSIGHTS

Why contrastive losses die on deterministic encoders

The STFT/FFT faithfully reports every pixel-level difference between augmented views. Two crops of the same image produce signatures as different as two different images. Without a learned layer to absorb augmentation variance, InfoNCE has nothing to align. Solutions: (a) augmentation-invariant features (scattering), (b) thin learned projection, (c) class-conditional contrastive (6.2b), (d) drop contrastive entirely (6.2a).

The Cholesky insight

L diagonal encodes "new angular information per tier given all lower tiers." This IS discriminative (proved by v4 reaching 46.8% with CE alone). The 44-d signature on S^43 carries real inter-channel geometry. Next question: is the STFT front-end the bottleneck, or the 44-d signature?

Scattering is the clear next step

82% on CIFAR-10 with zero learned params (literature) vs our 46.8%. Scattering is translation-invariant AND deformation-stable (Lipschitz). This directly addresses the augmentation sensitivity problem. kymatio provides GPU-accelerated PyTorch implementation.

The dimension question

S^15 (band_dim=16) vs S^43 (signature) vs S^127 (conv encoder output) E₈ lattice gives 240 optimal anchors on S^7 Proven CV attractor at ~0.20 is on S^15 Need to test which target sphere dimension is optimal for spectral features

Last updated: 2026-03-18, session with Opus Next: run scattering baseline (1.1a), then decide pipeline direction