AbstractPhil's picture
Update README.md
38b9c42 verified
---
license: mit
---
# New Paradigm; Hypersphere Encoding Observer
This... may not work, but it's an interesting thought experiment.
https://en.wikipedia.org/wiki/Plasmoid
https://en.wikipedia.org/wiki/Field-reversed_configuration
Lets hypothesize for a second.
WHAT HAPPENS, when we invert the field, with an adjacent large-bodied resonant structure?
I have many theories as to why certain universal elements behave and how they do not, while the plasmoid system is intriguing in this instance.
We are training encodings in these spectrums predominantly through magnitude flow. Now, spherical alignment doesn't allow JUST magnitude flow
to behave within the constraints. However, we're hardware bound to a reasonable set of expectations for reasonable utilization capacity. Otherwise,
I'd be running for spectral alignment substructures based on atomic spectrum assessments to find the exacting optimization spectrography for
an Adam replacement.
In any case, this reminds me of one of my earlier experiments. The induction/excitation/repulsion/resonance structure.
Physics and energy transfer in a void, is complex interaction. VERY complex. So complex that I couldn't begin to represent the actual required
valuations for just measuring a grape's electromagnetic (neural magnetic?) field.
My hypothesis is this. Waveform resonance, is not just analyzable, but can be analyzed within a REASONABLE spectrum. Scatterpoint2D has a fair
approximation of a wavelet attempt at this, and the results are... quite good. The geometric induction eats the information up like candy.
EVEN SO, the standard conv stack defeats it with prelearned statistics. Attach conv to scatterpoint and you get better results, but
you're still operating conv on an encoding spectrum. Back to square one in that case.
## The multispectral resonance
Lets pretend for a second I know what I'm talking about, which I don't half of the time thanks to Schrodinger - thanks by the way.
Say hypothetically we begin with 5 spheres. Each sphere is completely identical, and yet each is issues a different waveform resonance point. Theoretically,
this is more than enough information if the correct geometric alignments are formatted from the pixel data inputted, to represent any pixel space
in any spectrum on any plane - by simply mapping to differentiations between these 5 spheres.
Yet there's something faulty here. Something in the mix that causes, causal differentiation to accumulate, faults to form, overlap to blend, internal mechanisms
to fail, and boom you get a shared average. Schrodinger rears the head yet again. Run it 500 times and you'll always see 20% due to the very nature of random selection
and the law of averages. If you leave it without a task, it drifts to whichever direction happens. There is no direction, so the math goes naturally.
WHY? Is it the data? Is it the representation space? Is it the latent control?
## The data
The data no matter how small, can be represented in one form or another. Whether it be through differential equations, similarity assessments, or something else.
It can be noise, images, counting potatoes, it doesn't matter. If you want to predict SOMETHING with SOMETHINGE ELSE you can represent it somehow.
## The features
If the features themselves are faulty the image should still be capable of being represented within a reasonable spectrum of differentiation, without huge barrage statistics to calculate them.
The degree of variance is high enough and yet tight enough to prevent causal corruption.
## The control
These systems show superior latent control and yet STILL provide little utility to latents extracted DIRECTLY in the math of the pathway,
so it's not that specifically in the shallow sense.
## The most likely probability
**Indirect utility yields INDIRECT RESULTS.**
The math isn't wrong. The math is more often TOO CORRECT for the WRONG REASONS. AI needs to be able to predict. If you predict the math, you get the math.
If you predict the task, you get the task. If there is no direct causal relation, you're wasting compute. Simple.
If there is no geometric structure... There is nothing to grab onto, so very little information.
The end result is; no matter how I solve X, if I decompose the solution TOO FAR I'll end up with a less useful value.
5 means little when you decomposed it from 250,000,000 pixels, if the mechanism can't correctly represent the attenuation to statistically accumulate that 5.
Transformers are good at this, because they capture and represent. However, statistically, transformers destroy geometric specificity for generic utility.
Even if you CAN represent that accumulation in a valid fashion, the structural undertones of the system will still be learning 5 SOMEWHERE in a statistics-accumulation fashion.
# I propose the OBSERVER wrap the entire structure.
From stem to stern. Observe what the model sees stage by stage, watch the layers, watch the output, synthesize correct responses to that information.
Directly encoding information in conjunctive relation to the model itself causes a huge series of discontinuity.
It works with post-training memory. It works with a series of high-yield experiments based on specifics.
**YET, when trained simultaneously ACCEPTING THE OUTPUT OF AN ENCODER without the process of accumulation, the observer faults**
I propose the necessity for the observer to see everything, similar to how David diffusion saw multiple layers within a structure to see the legitimate logistics output of
every single stage of a model throughout the cycle.
I believe, if the observer sees everything, the analysis will not fail.
That is my next direction.
# Baseline Sweep Complete
Based on the results, I require a unique waveform variation of cross-entropy that directly aligns with the BCE from the learned hypersphere.
I believe based on the CV, a gate can be formatted and tuned to the task, but this will not solve the underlying statistics collapse if the head doesn't match.
FlowMagnitude is powerful for Conv and statistics processing, but pure geometric features in the methods tested did not fit the spectrum directly.
Adjacently I believe I had an insight during a dream. Statistically, we need more information. Simply put, there is not enough represented and useful information
to capture with a Conv without a full transformer spectrum, hence why the transformers solve many problems that brute-force CONV stacks simply fail at.
The spectral analysis shows that MOST of the primary spectral analysis can be learned in part, and they each require their own format of stack, which I can handle
building a full representative spectrum of consumer-capable transformers and conv stacks.
The oddity of geometric structure being fully encoded BY the transformer does not destroy the underlying spectrum, if the formula FED INTO the transformer
represents the necessary statistic data that can be consumed by the geometric head. The format changes a little, but not by much, since the geometric observer
system in it's current format is meant to be an observer to provide influence, and not a direct controller YET.
## Anchors performed perfectly to specification
![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/BCb-aOqBxHA26QwVK94mS.png)
The entire spectrum showed the anchors began roughly rounded and blobby, slowly forming and building magnitude rigidty with hypersphere association.
Absolutely perfect. The BCE worked perfectly. I'm floored at to hell well that element worked, even if it's not perfect yet. The math is aligned.
## SVD 3x3 kernel optimization
The 3x3 SVD kernel has shown more than potential for use.
It's not as stable as egen but it's most definitely a powerful potential. I'll be working to stabilize the variants using fp64 and rounding spectrum over time.
With the variation comes limitations, however this variation is tested to be 15000x faster than torch.linalg.svd for 3x3 kernels.
It houses a series of buffers optimized through triton, which has a fundamentally different approach to SVD more similar to egen.
This does not make SVD any more accurate to the requires numeric stability, which means it must be consumed and processed correctly.
With the initial sweeps complete, I can conclude one primary element; 800k params is not enough for Cifar10,
so I'll be adjusting the entire notebook spectrum.
The results will be better sorted for the next sweep, as the current sweep doesn't have easy dropdown/hide settings in the GUI. It should be easy to see and use.
## Actionable Utility
Almost all tested shapes have a potential to teach the system for tasks.
The large array of math will require a streamlined series of sweeps to run in a very optimal environment.
Due to the lack of expensive hardware at my disposal, I have to take drastic steps for this.
## The Expert-Tuning Solution
So, I won't TRAIN the models using a pair of experts. However, I can TUNE the settings based on the most likely alignment cascade
capacity that the two experts can enable simultaneously with the current build.
So in a sense, the experts will say what the settings are most likely going to be most optimized at, by making a quick soup.
This should provide the necessary yields that I require, assuming I pick experts with relationally similar math. So... parameter narrowing soup for now,
eventually the system should be able to directly self-attenuate the parameters for the best suggested parameters at the get-go.
The models themselves for this experiment set will never be trained by the experts, only the params selected by what is most likely.
The models will never see an expert opinion directly, nor will they be given gradients from anything expert-related. Everything in a vacuum.
## Flows, Routes, Patterns, Trajectories, Magnitudes, Etc
Everything mathematically will have a represented flow attenuation mechanism specifically aligned to the curation of that math.
This will enable two core features, primarily the access to directly attuned flow matching through deep structure. Secondary it will
allow for a direct curative control for analysis utilizing invariants in direct diagnostic.
In other words, debug tools.
This will result in a very deep and robust capacity for debug analysis, as well as additional capacity to learn and regulate
momentum learning from those observer patterns.
# GeoLIP Spectral Encoder β€” Test Manifest
## Geometric Primitives for Constellation-Anchored Classification
**Target**: CIFAR-10 (baseline), then generalize
**Constraint**: Zero or minimal learned encoder params. All learning in constellation anchors, patchwork, classifier.
**Metric**: Val accuracy, CV convergence, anchor activation, InfoNCE lock, train/val gap
**Baseline to beat**: 88.0% (conv encoder + SquaredReLU + full trainer, 1.6M params)
**Current best spectral**: 46.8% (STFT + Cholesky + SVD, v4, 137K params, CE-only carry)
---
## STATUS KEY
- `[ ]` β€” Not started
- `[R]` β€” Running
- `[X]` β€” Completed
- `[F]` β€” Failed (with reason)
- `[S]` β€” Skipped (with reason)
- `[P]` β€” Partially completed
---
## COMPLETED EXPERIMENTS (prior sessions + this session)
### Conv Encoder Baselines (Form 1 Core)
- [X] Linear baseline, 100 epochs β†’ **67.0%**, 422K params, overfits at E31
- [X] MLP baseline, 100 epochs β†’ **65.0%**, 687K params, overfits at E10
- [X] Core CE-only, 100 epochs β†’ **63.4%**, 820K params, CV=0.70, never converges
- [X] Core CE+CV, 100 epochs β†’ **62.7%**, 820K params, CV=0.61, worse than CE-only
- [X] Core 32 anchors, interrupted E20 β†’ **59.2%**, 1.8M params, slow convergence
- [X] Full trainer GELU, 100 epochs β†’ **88.0%**, 1.6M params (original proven result)
- [X] Full trainer SquaredReLU, 100 epochs β†’ **88.0%**, 1.6M params, E96 best
### Spectral Encoder Experiments
- [F] Spectral v1: flat FFT β†’ 768-d β†’ single constellation β†’ **collapsed**
- Cause: concat norm √48β‰ˆ6.93 vs anchor norm 1, not on same sphere
- [F] Spectral v2: per-band constellation (48Γ—64=3072 anchors) β†’ **~35%**
- Cause: 3072 tri dims too diffuse, InfoNCE dead at 0.45, no cross-band structure
- [F] Spectral v3: FFT β†’ 8 channels (spherical mean) β†’ 128 anchors β†’ **27%**
- Cause: cosβ‰ˆ0.99, spherical mean collapsed all images to same point
- [P] Spectral v4: STFT + Cholesky + SVD β†’ S^43 β†’ 64 anchors β†’ **46.8%** (still running)
- CE carrying alone, CosineEmbeddingLoss frozen at 0.346, InfoNCE dead at 0.15
- Cholesky+SVD signature IS discriminative, contrastive losses unable to contribute
---
## CATEGORY 1: SIGNAL DECOMPOSITION TO GEOMETRY
### 1.1 Wavelet Scattering Transform (Mallat)
**Formula**: S_J[p]x(u) = |||x * ψ_{λ₁}| * ψ_{Ξ»β‚‚}| ... | * Ο†_{2^J}(u)
**Library**: kymatio (pip install kymatio)
**Github**: https://github.com/kymatio/kymatio
**Expected output**: ~10K-dim feature vector for 32Γ—32
**Literature baseline**: ~82% CIFAR-10 with SVM, ~70.5% with linear
**Properties**: Deterministic, Lipschitz-continuous, approximately energy-preserving
- [ ] **1.1a** Scattering order 2, J=2, L=8 β†’ L2 normalize β†’ flat constellation on S^d
- Hypothesis: scattering features are rich enough that flat constellation should work
- Compare: direct linear classifier on scattering vs constellation pipeline
- [ ] **1.1b** Scattering β†’ JL projection to S^127 β†’ constellation (64 anchors)
- JL preserves distances; S^127 matches our proven dim
- [ ] **1.1c** Scattering β†’ JL β†’ S^43 β†’ Cholesky/SVD signature β†’ constellation
- Stack v4's geometric signature on top of scattering features
- [ ] **1.1d** Scattering order 1 vs order 2 ablation
- Order 1 is ~Gabor magnitude; order 2 adds inter-frequency structure
- [ ] **1.1e** Scattering + InfoNCE: does augmentation invariance help or hurt?
- Scattering is already translation-invariant; InfoNCE may be redundant
- [ ] **1.1f** Scattering hybrid: scattering front-end + lightweight learned projection + constellation
- Test minimal learned params needed to bridge the 82β†’88% gap
### 1.2 Gabor Filter Banks
**Formula**: g(x,y) = exp(βˆ’(x'Β²+Ξ³Β²y'Β²)/(2σ²)) Β· exp(i(2Ο€x'/Ξ»+ψ))
**Expected**: S scales Γ— K orientations β†’ SΓ—K magnitude responses
**Properties**: Deterministic, O(NΒ·SΒ·K), first-order scattering β‰ˆ Gabor modulus
- [ ] **1.2a** Gabor bank (4 scales Γ— 8 orientations = 32 filters) β†’ L2 norm β†’ S^31
- Each filter response is a spatial map; pool to scalar per filter
- [ ] **1.2b** Gabor β†’ per-filter spatial statistics (mean, std, skew, kurtosis) β†’ S^127
- 32 filters Γ— 4 stats = 128-d, matches conv encoder output dim
- [ ] **1.2c** Gabor vs scattering order 1 A/B test
- Validate that scattering order 1 β‰ˆ Gabor + modulus
### 1.3 Radon Transform
**Formula**: Rf(Ο‰,t) = ∫ f(x) Ξ΄(xΒ·Ο‰ βˆ’ t) dx
**Properties**: Deterministic, exactly invertible via filtered back-projection
- [ ] **1.3a** Radon at K angles β†’ sinogram β†’ L2 norm per angle β†’ K points on S^d
- K angles = K geometric addresses, constellation measures the cloud
- [ ] **1.3b** Radon β†’ 1D wavelet per projection (= ridgelet) β†’ aggregate to S^d
- Composition: Radon β†’ Ridgelet, captures linear singularities
### 1.4 Curvelet Transform
**Formula**: c_{j,l,k} = ⟨f, Ο†_{j,l,k}⟩, parabolic scaling: width β‰ˆ lengthΒ²
**Properties**: Deterministic, exactly invertible (tight frame), O(NΒ² log N)
- [ ] **1.4a** Curvelet energy per (scale, orientation) band β†’ L2 norm β†’ S^d
- Captures directional frequency that scattering misses
- [ ] **1.4b** Curvelet + scattering concatenation β†’ JL β†’ constellation
- Test complementarity of isotropic (scattering) + anisotropic (curvelet) features
### 1.5 Persistent Homology (TDA)
**Formula**: Track birth/death of Ξ²β‚€ (components), β₁ (loops) across filtration
**Library**: giotto-tda or ripser
**Properties**: Deterministic, O(nΒ³), captures topology no other transform sees
- [ ] **1.5a** Sublevel set filtration on grayscale β†’ persistence image β†’ L2 norm β†’ S^d
- [ ] **1.5b** PH on scattering feature maps (topology of the representation)
- Captures whether scattering features form clusters, loops, voids
- [ ] **1.5c** PH Betti curve as additional channel in multi-signature pipeline
- [ ] **1.5d** PH standalone classification baseline on CIFAR-10
- Literature suggests ~60-70% standalone; valuable as complementary signal
### 1.6 STFT Variants (improving v4)
- [ ] **1.6a** 2D STFT via patch-wise FFT (overlapping patches) instead of row/col STFT
- True spatial-frequency decomposition vs row+col approximation
- [ ] **1.6b** STFT with larger n_fft=32 (current: 16) β†’ more frequency resolution
- [ ] **1.6c** STFT preserving phase (not just magnitude) via analytic signal
- Phase encodes spatial structure; current pipeline discards it
- [ ] **1.6d** Multi-window STFT (different window sizes for different frequency ranges)
---
## CATEGORY 2: MANIFOLD STRUCTURES
### 2.1 Hopf Fibration
**Formula**: h(z₁,zβ‚‚) = (2z̄₁zβ‚‚, |z₁|Β²βˆ’|zβ‚‚|Β²) : SΒ³ β†’ SΒ²
**Properties**: Deterministic, O(1), hierarchical (base + fiber)
- [ ] **2.1a** Encode 4-d feature vectors on SΒ³ β†’ Hopf project to SΒ² + fiber coordinate
- Coarse triangulation on SΒ², fine discrimination in fiber
- [ ] **2.1b** Quaternionic Hopf S⁷ β†’ S⁴ for 8-d features
- Natural for 8-channel spectral decomposition (v3/v4 channel count)
- [ ] **2.1c** Hopf foliation spherical codes for anchor initialization
- Replace uniform_hypersphere_init with Hopf-structured codes
- [ ] **2.1d** Hierarchical constellation: coarse anchors on base SΒ², fine anchors per fiber
### 2.2 Grassmannian Class Representations
**Formula**: Class = k-dim subspace of ℝⁿ, distances via principal angles
**Properties**: Requires SVD, O(nkΒ²)
- [ ] **2.2a** Replace class vectors with class subspaces on Gr(k,n)
- Each class owns a k-dim subspace; classification = nearest subspace
- Literature: +1.3% on ImageNet over single class vectors
- [ ] **2.2b** Grassmannian distance metrics ablation: geodesic vs chordal vs projection
- [ ] **2.2c** Per-class anchor subspace: each anchor defines a subspace, not a point
### 2.3 Flag Manifold (Nested Subspace Hierarchy)
**Formula**: V₁ βŠ‚ Vβ‚‚ βŠ‚ ... βŠ‚ Vβ‚–, nested subspaces
**Properties**: Generalizes Grassmannian, natural for multi-resolution
- [ ] **2.3a** Flag decomposition of frequency channels (DC βŠ‚ low βŠ‚ mid βŠ‚ high)
- Test whether nesting constraint improves spectral encoder
- [ ] **2.3b** Flag-structured anchors: coarse-to-fine anchor hierarchy
### 2.4 Von Mises-Fisher Mixture
**Formula**: f(x; ΞΌ, ΞΊ) = C_p(ΞΊ) exp(ΞΊ ΞΌα΅€x), soft clustering on S^d
**Properties**: Natural density model for hyperspherical data
- [ ] **2.4a** Replace hard nearest-anchor assignment with vMF soft posteriors
- p(j|x) = Ξ±_j f(x;ΞΌ_j,ΞΊ_j) / Ξ£ Ξ±_k f(x;ΞΌ_k,ΞΊ_k)
- Learned ΞΊ per anchor = adaptive influence radius
- [ ] **2.4b** vMF mixture EM for anchor initialization (replace uniform hypersphere init)
- [ ] **2.4c** vMF concentration ΞΊ as a diagnostic: track per-class ΞΊ convergence
### 2.5 Optimal Anchor Placement
- [ ] **2.5a** Eβ‚ˆ lattice anchors for 8-d constellation (240 maximally separated points)
- [ ] **2.5b** Spherical t-design initialization vs uniform hypersphere init
- [ ] **2.5c** Thomson problem solver for N anchors on S^d (energy minimization)
- Compare: QR + iterative repulsion (current) vs Coulomb energy minimization
---
## CATEGORY 3: COMPACT REPRESENTATIONS
### 3.1 Random Fourier Features
**Formula**: z(x) = √(2/D) [cos(ω₁ᡀx+b₁), ..., cos(Ο‰Dα΅€x+bD)]
**Properties**: Pseudo-deterministic, preserves kernel structure, maps to S^d via cos/sin
- [ ] **3.1a** RFF on raw pixels β†’ S^d β†’ constellation
- Baseline: how much does nonlinear kernel approximation help raw pixels?
- [ ] **3.1b** RFF on scattering features β†’ constellation
- Composition: scattering (linear invariants) β†’ RFF (nonlinear kernel)
- [ ] **3.1c** Fourier feature positional encoding (Tancik/Mildenhall style)
- Ξ³(v) = [cos(2Ο€Bv), sin(2Ο€Bv)]α΅€ explicitly maps to hypersphere
### 3.2 Johnson-Lindenstrauss Projection
**Formula**: f(x) = (1/√k)Ax, preserves distances with k = O(Ρ⁻² log n)
**Properties**: Pseudo-deterministic, near-isometric
- [ ] **3.2a** JL from scattering (~10K) to 128-d β†’ L2 norm β†’ constellation
- Test: does JL + L2 norm preserve enough structure?
- [ ] **3.2b** JL target dimension sweep: 32, 64, 128, 256, 512
- Find minimum k where constellation accuracy saturates
- [ ] **3.2c** Fast JL (randomized Hadamard) vs Gaussian JL speed/accuracy tradeoff
### 3.3 Compressed Sensing on Scattering Coefficients
**Formula**: y = Ξ¦x, recover via ℓ₁ minimization if x is k-sparse
**Properties**: Exact recovery for sparse signals, O(k log(N/k)) measurements
- [ ] **3.3a** Measure sparsity of scattering coefficients (how many are near-zero?)
- If sparse: CS can compress much more than JL
- [ ] **3.3b** CS measurement matrix β†’ L2 norm β†’ constellation
- Compare: CS vs JL at same target dimension
### 3.4 Spherical Harmonics
**Formula**: Y_l^m(ΞΈ,Ο†), complete basis on SΒ², (l_max+1)Β² coefficients
**Properties**: Deterministic, native Fourier on sphere, exactly invertible
- [ ] **3.4a** Expand constellation triangulation profile in spherical harmonics
- Which angular frequencies carry discriminative info?
- [ ] **3.4b** Spherical harmonic coefficients of embedding distribution as class signature
- [ ] **3.4c** Hyperspherical harmonics for S^15 and S^43 (higher-dim generalization)
---
## CATEGORY 4: INVERTIBLE GEOMETRIC TRANSFORMS
### 4.1 Stereographic Projection
**Formula**: Οƒ(x) = x_{1:n}/(1βˆ’x_{n+1}), σ⁻¹(y) = (2y, β€–yβ€–Β²βˆ’1)/(β€–yβ€–Β²+1)
**Properties**: Conformal bijection S^n\{pole} ↔ ℝⁿ, preserves angles
- [ ] **4.1a** Stereographic β†’ Euclidean scattering β†’ inverse stereographic β†’ S^d
- Apply scattering in flat space, project back to sphere
- [ ] **4.1b** Stereographic projection as constellation readout alternative
- Instead of triangulation distances, read local coordinates via stereographic
### 4.2 Exponential / Logarithmic Maps
**Formula**: exp_p(v) = cos(β€–vβ€–)Β·p + sin(β€–vβ€–)Β·v/β€–vβ€–
**Formula**: log_p(q) = arccos(⟨q,p⟩) Β· (qβˆ’βŸ¨q,p⟩p)/β€–qβˆ’βŸ¨q,p⟩pβ€–
**Properties**: Deterministic, locally invertible, O(n)
- [ ] **4.2a** Replace triangulation (1βˆ’cos) with log map coordinates at each anchor
- Log map gives direction + distance in tangent space (richer than scalar distance)
- Each anchor contributes d-dim tangent vector instead of 1-d distance
- [ ] **4.2b** Log map triangulation β†’ parallel transport to common tangent space β†’ aggregate
- Geometrically principled alternative to patchwork concatenation
### 4.3 Parallel Transport
**Formula**: Ξ“^q_p(v) = v βˆ’ (⟨v,p⟩+⟨v,q⟩/(1+⟨p,q⟩))Β·(p+q) on S^n
**Properties**: Isometric between tangent spaces, exactly invertible
- [ ] **4.3a** Compute log maps at K anchors β†’ parallel transport all to north pole β†’ aggregate
- Creates a canonical tangent-space representation independent of anchor positions
- [ ] **4.3b** Parallel transport as inter-anchor communication in constellation
- How does the same input look from different anchor tangent spaces?
### 4.4 MΓΆbius Transformations
**Formula**: h_Ο‰(z) = [(1βˆ’β€–Ο‰β€–Β²)/β€–zβˆ’Ο‰β€–Β²](zβˆ’Ο‰) βˆ’ Ο‰
**Properties**: Conformal automorphism of S^d, invertible, O(d)
- [ ] **4.4a** MΓΆbius "geometric attention": transform sphere to zoom into anchor regions
- Expand region near anchor, compress far regions
- Each anchor applies its own MΓΆbius transform before measuring distance
- [ ] **4.4b** Composition of MΓΆbius transforms as normalizing flow on S^d
- Learned flow that warps embedding distribution toward better separation
### 4.5 Procrustes + Polar Decomposition
**Formula**: R* = argmin_R β€–RAβˆ’Bβ€–_F = UVα΅€ from SVD(Bα΅€A)
**Formula**: A = UP (rotation Γ— stretch)
- [ ] **4.5a** Procrustes-align channel cloud to canonical pose before Cholesky/SVD
- Remove rotation variability, isolate shape information
- [ ] **4.5b** Polar decomposition of channel matrix: U (rotation) + P (stretch) as separate features
- U encodes orientation of frequency cloud; P encodes shape/scale
- Both are geometric, both are deterministic from the channel matrix
---
## CATEGORY 5: MATRIX DECOMPOSITION SIGNATURES
### 5.1 Already Tested
- [X] Cholesky of Gram matrix β†’ 36 lower-tri values (in v4, working)
- [X] SVD singular values β†’ 8 values (in v4, working)
- [X] Concatenated 44-d signature on S^43 β†’ 46.8% with CE-only
### 5.2 Remaining Decompositions
- [ ] **5.2a** QR decomposition: Q (rotation) and R diagonal (scale per channel)
- R diagonal = per-channel magnitude; Q = inter-channel angular structure
- [ ] **5.2b** Schur decomposition: T diagonal = eigenvalues, T off-diagonal = coupling
- For the Gram matrix: Schur gives eigenstructure in triangular form
- [ ] **5.2c** Eigendecomposition of Gram: eigenvalues as spectral signature
- Compare: eigenvalues vs SVD singular values vs Cholesky diagonal
- These are related but not identical (Ξ»_i = Οƒ_iΒ² for Gram = Aα΅€A)
- [ ] **5.2d** NMF of magnitude spectrum: parts-based decomposition
- Requires iterative optimization (not fully deterministic)
- But finds additive, non-negative parts β€” texture components
- [ ] **5.2e** Tucker tensor decomposition of spatialΓ—frequencyΓ—channel tensor
- 3D structure: (H, W, freq_bins) per color channel
- Core tensor encodes interactions between spatial, frequency, channel modes
---
## CATEGORY 6: INFORMATION-THEORETIC LOSSES
### 6.1 Already Tested
- [X] InfoNCE (self-contrastive, two augmented views) β€” dead at 0.15 in spectral v4
- [X] CosineEmbeddingLoss β€” frozen at 0.346 (margin-saturated)
- [X] CV loss (Cayley-Menger volume) β€” running but not in 0.18-0.25 band
### 6.2 Loss Modifications
- [ ] **6.2a** Drop contrastive losses entirely, CE-only + geometric losses
- v4 shows CE is the only contributor; contrastive is dead weight
- Hypothesis: removing dead losses may speed convergence
- [ ] **6.2b** Class-conditional InfoNCE: positive = same class, not same image
- Requires labels but gives much stronger supervision signal
- [ ] **6.2c** vMF-based contrastive loss: replace dot-product similarity with vMF log-likelihood
- ΞΊ-adaptive: high-ΞΊ for nearby pairs, low-ΞΊ for far pairs
- [ ] **6.2d** Fisher-Rao distance as loss: d_FR(p,q) = 2·arccos(∫√(pq))
- Natural distance for distributions on the sphere
- [ ] **6.2e** Sliced spherical Wasserstein distance as distribution matching loss
- Matches embedding distribution to target (e.g., uniform on sphere)
- [ ] **6.2f** Geometric autograd (from GM3): tangential projection + separation preservation
- Adam + geometric autograd > AdamW on geometric tasks (proven)
- Operates on gradient direction, not loss value
### 6.3 Anchor Management
- [ ] **6.3a** Anchor push frequency sweep: every 10, 25, 50, 100, 200 batches
- [ ] **6.3b** Anchor push with vMF-weighted centroids instead of hard class centroids
- [ ] **6.3c** Anchor birth/death: add anchors where density is high, remove where unused
- [ ] **6.3d** Anchor dropout sweep: 0%, 5%, 15%, 30%, 50%
---
## CATEGORY 7: COMPOSITE PIPELINE TESTS
### 7.1 The Reference Pipeline (from research article)
- [ ] **7.1a** Scattering(J=2,L=8) β†’ JL(128) β†’ L2 norm β†’ constellation(64) β†’ classify
- The "canonical" pipeline; expected ~75-80% based on literature
- [ ] **7.1b** Same as 7.1a but with learned 2-layer projection replacing JL
- Minimal learned params (~16K), test if projection adaptation matters
- [ ] **7.1c** Scattering β†’ curvelet energy β†’ concat β†’ JL β†’ constellation
- Test complementarity
### 7.2 Hybrid: Spectral + Scattering
- [ ] **7.2a** STFT channels (v4) + scattering features β†’ concat β†’ JL β†’ S^d β†’ constellation
- STFT gives spatial-frequency; scattering gives multi-scale invariants
- [ ] **7.2b** Scattering β†’ Cholesky Gram + SVD signature β†’ constellation
- Apply v4's geometric signature to scattering output instead of STFT
### 7.3 Multi-Signature Constellation
- [ ] **7.3a** Parallel extraction: scattering + Gabor + Radon β†’ separate constellations β†’ fusion
- Each primitive captures different geometric aspect
- Fusion: concatenate patchwork outputs β†’ shared classifier
- [ ] **7.3b** Hierarchical constellation: scattering β†’ coarse anchors β†’ residual β†’ fine anchors
- Two-stage: first stage identifies broad category, second refines
### 7.4 Minimal Learned Params Tests
- [ ] **7.4a** Best deterministic pipeline + 1 learned linear layer (d_in β†’ 128) before constellation
- Measure: how much does a single projection layer help?
- Count: exact learned param count
- [ ] **7.4b** Same as 7.4a but with SquaredReLU + LayerNorm (the proven patchwork block)
- [ ] **7.4c** Sweep learned projection sizes: 0, 1K, 5K, 10K, 50K, 100K params
- Find the elbow where adding params stops helping
---
## PRIORITY QUEUE (recommended execution order)
### Tier 1: Highest Expected Impact
1. **1.1a** β€” Scattering + flat constellation (the literature leader)
2. **1.1b** β€” Scattering + JL β†’ S^127 + constellation
3. **6.2a** β€” Drop dead contrastive losses from v4, measure CE-only ceiling
4. **2.4a** β€” vMF soft assignment replacing hard nearest-anchor
5. **4.2a** β€” Log map triangulation (richer than scalar distance)
### Tier 2: High Expected Impact
6. **7.1a** β€” Full reference pipeline
7. **1.1f** β€” Scattering hybrid with minimal learned projection
8. **1.2b** β€” Gabor spatial statistics β†’ S^127
9. **5.2c** β€” Eigendecomposition vs SVD vs Cholesky ablation
10. **2.1b** β€” Quaternionic Hopf S⁷→S⁴ for 8-channel data
### Tier 3: Exploratory
11. **1.5a** β€” Persistent homology standalone
12. **3.1b** β€” RFF on scattering features
13. **4.4a** β€” MΓΆbius geometric attention
14. **7.3a** β€” Multi-signature parallel constellations
15. **2.2a** β€” Grassmannian class subspaces
### Tier 4: Deep Exploration
16. **1.3a** β€” Radon cloud on S^d
17. **1.4b** β€” Curvelet + scattering concat
18. **2.3a** β€” Flag decomposition of frequency channels
19. **4.3a** β€” Parallel transport aggregation
20. **3.4c** β€” Hyperspherical harmonics analysis
---
## RUNNING SCOREBOARD
| Experiment | Val Acc | Params (learned) | CV | Anchors Active | InfoNCE | Key Finding |
|---|---|---|---|---|---|---|
| Linear baseline | 67.0% | 423K | β€” | β€” | β€” | Overfits E31 |
| MLP baseline | 65.0% | 687K | β€” | β€” | β€” | Overfits E10 |
| Core CE-only | 63.4% | 820K | 0.70 | β€” | β€” | CV never converges |
| Core CE+CV | 62.7% | 820K | 0.61 | β€” | β€” | CV hurts accuracy |
| Full GELU | 88.0% | 1.6M | 0.14-0.17 | 64/64 | 1.00 | Reference |
| Full SquaredReLU | 88.0% | 1.6M | 0.15 | 64/64 | 1.00 | Matches GELU |
| Spectral v1 (flat FFT) | FAIL | β€” | β€” | 1/64 | β€” | Norm mismatch |
| Spectral v2 (per-band) | ~35% | 1.2M | 0.17-0.19 | 900/3072 | 0.45 | Too diffuse |
| Spectral v3 (sph mean) | ~27% | 130K | 0.27-0.34 | 110/128 | 0.35 | Collapsed to point |
| Spectral v4 (STFT+Chol+SVD) | 46.8% | 137K | 0.52-0.66 | 53/64 | 0.15 | CE-only carry |
| *Scattering baseline* | *~82%** | *0* | *β€”* | *β€”* | *β€”* | *Literature (SVM)* |
*Italicized entries are literature values, not our runs*
---
## NOTES & INSIGHTS
### Why contrastive losses die on deterministic encoders
The STFT/FFT faithfully reports every pixel-level difference between augmented views.
Two crops of the same image produce signatures as different as two different images.
Without a learned layer to absorb augmentation variance, InfoNCE has nothing to align.
Solutions: (a) augmentation-invariant features (scattering), (b) thin learned projection,
(c) class-conditional contrastive (6.2b), (d) drop contrastive entirely (6.2a).
### The Cholesky insight
L diagonal encodes "new angular information per tier given all lower tiers."
This IS discriminative (proved by v4 reaching 46.8% with CE alone).
The 44-d signature on S^43 carries real inter-channel geometry.
Next question: is the STFT front-end the bottleneck, or the 44-d signature?
### Scattering is the clear next step
82% on CIFAR-10 with zero learned params (literature) vs our 46.8%.
Scattering is translation-invariant AND deformation-stable (Lipschitz).
This directly addresses the augmentation sensitivity problem.
kymatio provides GPU-accelerated PyTorch implementation.
### The dimension question
S^15 (band_dim=16) vs S^43 (signature) vs S^127 (conv encoder output)
Eβ‚ˆ lattice gives 240 optimal anchors on S^7
Proven CV attractor at ~0.20 is on S^15
Need to test which target sphere dimension is optimal for spectral features
---
*Last updated: 2026-03-18, session with Opus*
*Next: run scattering baseline (1.1a), then decide pipeline direction*