File size: 33,004 Bytes
79bc886 cea1e4c 709b9e7 38b9c42 709b9e7 0db33d3 709b9e7 0db33d3 709b9e7 0db33d3 709b9e7 67035c3 c6c6536 cb1d3e6 c6c6536 5e42520 294abd3 67035c3 42d70d4 67035c3 d7c477e 79bc886 24b356c 79bc886 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 | ---
license: mit
---
# New Paradigm; Hypersphere Encoding Observer
This... may not work, but it's an interesting thought experiment.
https://en.wikipedia.org/wiki/Plasmoid
https://en.wikipedia.org/wiki/Field-reversed_configuration
Lets hypothesize for a second.
WHAT HAPPENS, when we invert the field, with an adjacent large-bodied resonant structure?
I have many theories as to why certain universal elements behave and how they do not, while the plasmoid system is intriguing in this instance.
We are training encodings in these spectrums predominantly through magnitude flow. Now, spherical alignment doesn't allow JUST magnitude flow
to behave within the constraints. However, we're hardware bound to a reasonable set of expectations for reasonable utilization capacity. Otherwise,
I'd be running for spectral alignment substructures based on atomic spectrum assessments to find the exacting optimization spectrography for
an Adam replacement.
In any case, this reminds me of one of my earlier experiments. The induction/excitation/repulsion/resonance structure.
Physics and energy transfer in a void, is complex interaction. VERY complex. So complex that I couldn't begin to represent the actual required
valuations for just measuring a grape's electromagnetic (neural magnetic?) field.
My hypothesis is this. Waveform resonance, is not just analyzable, but can be analyzed within a REASONABLE spectrum. Scatterpoint2D has a fair
approximation of a wavelet attempt at this, and the results are... quite good. The geometric induction eats the information up like candy.
EVEN SO, the standard conv stack defeats it with prelearned statistics. Attach conv to scatterpoint and you get better results, but
you're still operating conv on an encoding spectrum. Back to square one in that case.
## The multispectral resonance
Lets pretend for a second I know what I'm talking about, which I don't half of the time thanks to Schrodinger - thanks by the way.
Say hypothetically we begin with 5 spheres. Each sphere is completely identical, and yet each is issues a different waveform resonance point. Theoretically,
this is more than enough information if the correct geometric alignments are formatted from the pixel data inputted, to represent any pixel space
in any spectrum on any plane - by simply mapping to differentiations between these 5 spheres.
Yet there's something faulty here. Something in the mix that causes, causal differentiation to accumulate, faults to form, overlap to blend, internal mechanisms
to fail, and boom you get a shared average. Schrodinger rears the head yet again. Run it 500 times and you'll always see 20% due to the very nature of random selection
and the law of averages. If you leave it without a task, it drifts to whichever direction happens. There is no direction, so the math goes naturally.
WHY? Is it the data? Is it the representation space? Is it the latent control?
## The data
The data no matter how small, can be represented in one form or another. Whether it be through differential equations, similarity assessments, or something else.
It can be noise, images, counting potatoes, it doesn't matter. If you want to predict SOMETHING with SOMETHINGE ELSE you can represent it somehow.
## The features
If the features themselves are faulty the image should still be capable of being represented within a reasonable spectrum of differentiation, without huge barrage statistics to calculate them.
The degree of variance is high enough and yet tight enough to prevent causal corruption.
## The control
These systems show superior latent control and yet STILL provide little utility to latents extracted DIRECTLY in the math of the pathway,
so it's not that specifically in the shallow sense.
## The most likely probability
**Indirect utility yields INDIRECT RESULTS.**
The math isn't wrong. The math is more often TOO CORRECT for the WRONG REASONS. AI needs to be able to predict. If you predict the math, you get the math.
If you predict the task, you get the task. If there is no direct causal relation, you're wasting compute. Simple.
If there is no geometric structure... There is nothing to grab onto, so very little information.
The end result is; no matter how I solve X, if I decompose the solution TOO FAR I'll end up with a less useful value.
5 means little when you decomposed it from 250,000,000 pixels, if the mechanism can't correctly represent the attenuation to statistically accumulate that 5.
Transformers are good at this, because they capture and represent. However, statistically, transformers destroy geometric specificity for generic utility.
Even if you CAN represent that accumulation in a valid fashion, the structural undertones of the system will still be learning 5 SOMEWHERE in a statistics-accumulation fashion.
# I propose the OBSERVER wrap the entire structure.
From stem to stern. Observe what the model sees stage by stage, watch the layers, watch the output, synthesize correct responses to that information.
Directly encoding information in conjunctive relation to the model itself causes a huge series of discontinuity.
It works with post-training memory. It works with a series of high-yield experiments based on specifics.
**YET, when trained simultaneously ACCEPTING THE OUTPUT OF AN ENCODER without the process of accumulation, the observer faults**
I propose the necessity for the observer to see everything, similar to how David diffusion saw multiple layers within a structure to see the legitimate logistics output of
every single stage of a model throughout the cycle.
I believe, if the observer sees everything, the analysis will not fail.
That is my next direction.
# Baseline Sweep Complete
Based on the results, I require a unique waveform variation of cross-entropy that directly aligns with the BCE from the learned hypersphere.
I believe based on the CV, a gate can be formatted and tuned to the task, but this will not solve the underlying statistics collapse if the head doesn't match.
FlowMagnitude is powerful for Conv and statistics processing, but pure geometric features in the methods tested did not fit the spectrum directly.
Adjacently I believe I had an insight during a dream. Statistically, we need more information. Simply put, there is not enough represented and useful information
to capture with a Conv without a full transformer spectrum, hence why the transformers solve many problems that brute-force CONV stacks simply fail at.
The spectral analysis shows that MOST of the primary spectral analysis can be learned in part, and they each require their own format of stack, which I can handle
building a full representative spectrum of consumer-capable transformers and conv stacks.
The oddity of geometric structure being fully encoded BY the transformer does not destroy the underlying spectrum, if the formula FED INTO the transformer
represents the necessary statistic data that can be consumed by the geometric head. The format changes a little, but not by much, since the geometric observer
system in it's current format is meant to be an observer to provide influence, and not a direct controller YET.
## Anchors performed perfectly to specification

The entire spectrum showed the anchors began roughly rounded and blobby, slowly forming and building magnitude rigidty with hypersphere association.
Absolutely perfect. The BCE worked perfectly. I'm floored at to hell well that element worked, even if it's not perfect yet. The math is aligned.
## SVD 3x3 kernel optimization
The 3x3 SVD kernel has shown more than potential for use.
It's not as stable as egen but it's most definitely a powerful potential. I'll be working to stabilize the variants using fp64 and rounding spectrum over time.
With the variation comes limitations, however this variation is tested to be 15000x faster than torch.linalg.svd for 3x3 kernels.
It houses a series of buffers optimized through triton, which has a fundamentally different approach to SVD more similar to egen.
This does not make SVD any more accurate to the requires numeric stability, which means it must be consumed and processed correctly.
With the initial sweeps complete, I can conclude one primary element; 800k params is not enough for Cifar10,
so I'll be adjusting the entire notebook spectrum.
The results will be better sorted for the next sweep, as the current sweep doesn't have easy dropdown/hide settings in the GUI. It should be easy to see and use.
## Actionable Utility
Almost all tested shapes have a potential to teach the system for tasks.
The large array of math will require a streamlined series of sweeps to run in a very optimal environment.
Due to the lack of expensive hardware at my disposal, I have to take drastic steps for this.
## The Expert-Tuning Solution
So, I won't TRAIN the models using a pair of experts. However, I can TUNE the settings based on the most likely alignment cascade
capacity that the two experts can enable simultaneously with the current build.
So in a sense, the experts will say what the settings are most likely going to be most optimized at, by making a quick soup.
This should provide the necessary yields that I require, assuming I pick experts with relationally similar math. So... parameter narrowing soup for now,
eventually the system should be able to directly self-attenuate the parameters for the best suggested parameters at the get-go.
The models themselves for this experiment set will never be trained by the experts, only the params selected by what is most likely.
The models will never see an expert opinion directly, nor will they be given gradients from anything expert-related. Everything in a vacuum.
## Flows, Routes, Patterns, Trajectories, Magnitudes, Etc
Everything mathematically will have a represented flow attenuation mechanism specifically aligned to the curation of that math.
This will enable two core features, primarily the access to directly attuned flow matching through deep structure. Secondary it will
allow for a direct curative control for analysis utilizing invariants in direct diagnostic.
In other words, debug tools.
This will result in a very deep and robust capacity for debug analysis, as well as additional capacity to learn and regulate
momentum learning from those observer patterns.
# GeoLIP Spectral Encoder — Test Manifest
## Geometric Primitives for Constellation-Anchored Classification
**Target**: CIFAR-10 (baseline), then generalize
**Constraint**: Zero or minimal learned encoder params. All learning in constellation anchors, patchwork, classifier.
**Metric**: Val accuracy, CV convergence, anchor activation, InfoNCE lock, train/val gap
**Baseline to beat**: 88.0% (conv encoder + SquaredReLU + full trainer, 1.6M params)
**Current best spectral**: 46.8% (STFT + Cholesky + SVD, v4, 137K params, CE-only carry)
---
## STATUS KEY
- `[ ]` — Not started
- `[R]` — Running
- `[X]` — Completed
- `[F]` — Failed (with reason)
- `[S]` — Skipped (with reason)
- `[P]` — Partially completed
---
## COMPLETED EXPERIMENTS (prior sessions + this session)
### Conv Encoder Baselines (Form 1 Core)
- [X] Linear baseline, 100 epochs → **67.0%**, 422K params, overfits at E31
- [X] MLP baseline, 100 epochs → **65.0%**, 687K params, overfits at E10
- [X] Core CE-only, 100 epochs → **63.4%**, 820K params, CV=0.70, never converges
- [X] Core CE+CV, 100 epochs → **62.7%**, 820K params, CV=0.61, worse than CE-only
- [X] Core 32 anchors, interrupted E20 → **59.2%**, 1.8M params, slow convergence
- [X] Full trainer GELU, 100 epochs → **88.0%**, 1.6M params (original proven result)
- [X] Full trainer SquaredReLU, 100 epochs → **88.0%**, 1.6M params, E96 best
### Spectral Encoder Experiments
- [F] Spectral v1: flat FFT → 768-d → single constellation → **collapsed**
- Cause: concat norm √48≈6.93 vs anchor norm 1, not on same sphere
- [F] Spectral v2: per-band constellation (48×64=3072 anchors) → **~35%**
- Cause: 3072 tri dims too diffuse, InfoNCE dead at 0.45, no cross-band structure
- [F] Spectral v3: FFT → 8 channels (spherical mean) → 128 anchors → **27%**
- Cause: cos≈0.99, spherical mean collapsed all images to same point
- [P] Spectral v4: STFT + Cholesky + SVD → S^43 → 64 anchors → **46.8%** (still running)
- CE carrying alone, CosineEmbeddingLoss frozen at 0.346, InfoNCE dead at 0.15
- Cholesky+SVD signature IS discriminative, contrastive losses unable to contribute
---
## CATEGORY 1: SIGNAL DECOMPOSITION TO GEOMETRY
### 1.1 Wavelet Scattering Transform (Mallat)
**Formula**: S_J[p]x(u) = |||x * ψ_{λ₁}| * ψ_{λ₂}| ... | * φ_{2^J}(u)
**Library**: kymatio (pip install kymatio)
**Github**: https://github.com/kymatio/kymatio
**Expected output**: ~10K-dim feature vector for 32×32
**Literature baseline**: ~82% CIFAR-10 with SVM, ~70.5% with linear
**Properties**: Deterministic, Lipschitz-continuous, approximately energy-preserving
- [ ] **1.1a** Scattering order 2, J=2, L=8 → L2 normalize → flat constellation on S^d
- Hypothesis: scattering features are rich enough that flat constellation should work
- Compare: direct linear classifier on scattering vs constellation pipeline
- [ ] **1.1b** Scattering → JL projection to S^127 → constellation (64 anchors)
- JL preserves distances; S^127 matches our proven dim
- [ ] **1.1c** Scattering → JL → S^43 → Cholesky/SVD signature → constellation
- Stack v4's geometric signature on top of scattering features
- [ ] **1.1d** Scattering order 1 vs order 2 ablation
- Order 1 is ~Gabor magnitude; order 2 adds inter-frequency structure
- [ ] **1.1e** Scattering + InfoNCE: does augmentation invariance help or hurt?
- Scattering is already translation-invariant; InfoNCE may be redundant
- [ ] **1.1f** Scattering hybrid: scattering front-end + lightweight learned projection + constellation
- Test minimal learned params needed to bridge the 82→88% gap
### 1.2 Gabor Filter Banks
**Formula**: g(x,y) = exp(−(x'²+γ²y'²)/(2σ²)) · exp(i(2πx'/λ+ψ))
**Expected**: S scales × K orientations → S×K magnitude responses
**Properties**: Deterministic, O(N·S·K), first-order scattering ≈ Gabor modulus
- [ ] **1.2a** Gabor bank (4 scales × 8 orientations = 32 filters) → L2 norm → S^31
- Each filter response is a spatial map; pool to scalar per filter
- [ ] **1.2b** Gabor → per-filter spatial statistics (mean, std, skew, kurtosis) → S^127
- 32 filters × 4 stats = 128-d, matches conv encoder output dim
- [ ] **1.2c** Gabor vs scattering order 1 A/B test
- Validate that scattering order 1 ≈ Gabor + modulus
### 1.3 Radon Transform
**Formula**: Rf(ω,t) = ∫ f(x) δ(x·ω − t) dx
**Properties**: Deterministic, exactly invertible via filtered back-projection
- [ ] **1.3a** Radon at K angles → sinogram → L2 norm per angle → K points on S^d
- K angles = K geometric addresses, constellation measures the cloud
- [ ] **1.3b** Radon → 1D wavelet per projection (= ridgelet) → aggregate to S^d
- Composition: Radon → Ridgelet, captures linear singularities
### 1.4 Curvelet Transform
**Formula**: c_{j,l,k} = ⟨f, φ_{j,l,k}⟩, parabolic scaling: width ≈ length²
**Properties**: Deterministic, exactly invertible (tight frame), O(N² log N)
- [ ] **1.4a** Curvelet energy per (scale, orientation) band → L2 norm → S^d
- Captures directional frequency that scattering misses
- [ ] **1.4b** Curvelet + scattering concatenation → JL → constellation
- Test complementarity of isotropic (scattering) + anisotropic (curvelet) features
### 1.5 Persistent Homology (TDA)
**Formula**: Track birth/death of β₀ (components), β₁ (loops) across filtration
**Library**: giotto-tda or ripser
**Properties**: Deterministic, O(n³), captures topology no other transform sees
- [ ] **1.5a** Sublevel set filtration on grayscale → persistence image → L2 norm → S^d
- [ ] **1.5b** PH on scattering feature maps (topology of the representation)
- Captures whether scattering features form clusters, loops, voids
- [ ] **1.5c** PH Betti curve as additional channel in multi-signature pipeline
- [ ] **1.5d** PH standalone classification baseline on CIFAR-10
- Literature suggests ~60-70% standalone; valuable as complementary signal
### 1.6 STFT Variants (improving v4)
- [ ] **1.6a** 2D STFT via patch-wise FFT (overlapping patches) instead of row/col STFT
- True spatial-frequency decomposition vs row+col approximation
- [ ] **1.6b** STFT with larger n_fft=32 (current: 16) → more frequency resolution
- [ ] **1.6c** STFT preserving phase (not just magnitude) via analytic signal
- Phase encodes spatial structure; current pipeline discards it
- [ ] **1.6d** Multi-window STFT (different window sizes for different frequency ranges)
---
## CATEGORY 2: MANIFOLD STRUCTURES
### 2.1 Hopf Fibration
**Formula**: h(z₁,z₂) = (2z̄₁z₂, |z₁|²−|z₂|²) : S³ → S²
**Properties**: Deterministic, O(1), hierarchical (base + fiber)
- [ ] **2.1a** Encode 4-d feature vectors on S³ → Hopf project to S² + fiber coordinate
- Coarse triangulation on S², fine discrimination in fiber
- [ ] **2.1b** Quaternionic Hopf S⁷ → S⁴ for 8-d features
- Natural for 8-channel spectral decomposition (v3/v4 channel count)
- [ ] **2.1c** Hopf foliation spherical codes for anchor initialization
- Replace uniform_hypersphere_init with Hopf-structured codes
- [ ] **2.1d** Hierarchical constellation: coarse anchors on base S², fine anchors per fiber
### 2.2 Grassmannian Class Representations
**Formula**: Class = k-dim subspace of ℝⁿ, distances via principal angles
**Properties**: Requires SVD, O(nk²)
- [ ] **2.2a** Replace class vectors with class subspaces on Gr(k,n)
- Each class owns a k-dim subspace; classification = nearest subspace
- Literature: +1.3% on ImageNet over single class vectors
- [ ] **2.2b** Grassmannian distance metrics ablation: geodesic vs chordal vs projection
- [ ] **2.2c** Per-class anchor subspace: each anchor defines a subspace, not a point
### 2.3 Flag Manifold (Nested Subspace Hierarchy)
**Formula**: V₁ ⊂ V₂ ⊂ ... ⊂ Vₖ, nested subspaces
**Properties**: Generalizes Grassmannian, natural for multi-resolution
- [ ] **2.3a** Flag decomposition of frequency channels (DC ⊂ low ⊂ mid ⊂ high)
- Test whether nesting constraint improves spectral encoder
- [ ] **2.3b** Flag-structured anchors: coarse-to-fine anchor hierarchy
### 2.4 Von Mises-Fisher Mixture
**Formula**: f(x; μ, κ) = C_p(κ) exp(κ μᵀx), soft clustering on S^d
**Properties**: Natural density model for hyperspherical data
- [ ] **2.4a** Replace hard nearest-anchor assignment with vMF soft posteriors
- p(j|x) = α_j f(x;μ_j,κ_j) / Σ α_k f(x;μ_k,κ_k)
- Learned κ per anchor = adaptive influence radius
- [ ] **2.4b** vMF mixture EM for anchor initialization (replace uniform hypersphere init)
- [ ] **2.4c** vMF concentration κ as a diagnostic: track per-class κ convergence
### 2.5 Optimal Anchor Placement
- [ ] **2.5a** E₈ lattice anchors for 8-d constellation (240 maximally separated points)
- [ ] **2.5b** Spherical t-design initialization vs uniform hypersphere init
- [ ] **2.5c** Thomson problem solver for N anchors on S^d (energy minimization)
- Compare: QR + iterative repulsion (current) vs Coulomb energy minimization
---
## CATEGORY 3: COMPACT REPRESENTATIONS
### 3.1 Random Fourier Features
**Formula**: z(x) = √(2/D) [cos(ω₁ᵀx+b₁), ..., cos(ωDᵀx+bD)]
**Properties**: Pseudo-deterministic, preserves kernel structure, maps to S^d via cos/sin
- [ ] **3.1a** RFF on raw pixels → S^d → constellation
- Baseline: how much does nonlinear kernel approximation help raw pixels?
- [ ] **3.1b** RFF on scattering features → constellation
- Composition: scattering (linear invariants) → RFF (nonlinear kernel)
- [ ] **3.1c** Fourier feature positional encoding (Tancik/Mildenhall style)
- γ(v) = [cos(2πBv), sin(2πBv)]ᵀ explicitly maps to hypersphere
### 3.2 Johnson-Lindenstrauss Projection
**Formula**: f(x) = (1/√k)Ax, preserves distances with k = O(ε⁻² log n)
**Properties**: Pseudo-deterministic, near-isometric
- [ ] **3.2a** JL from scattering (~10K) to 128-d → L2 norm → constellation
- Test: does JL + L2 norm preserve enough structure?
- [ ] **3.2b** JL target dimension sweep: 32, 64, 128, 256, 512
- Find minimum k where constellation accuracy saturates
- [ ] **3.2c** Fast JL (randomized Hadamard) vs Gaussian JL speed/accuracy tradeoff
### 3.3 Compressed Sensing on Scattering Coefficients
**Formula**: y = Φx, recover via ℓ₁ minimization if x is k-sparse
**Properties**: Exact recovery for sparse signals, O(k log(N/k)) measurements
- [ ] **3.3a** Measure sparsity of scattering coefficients (how many are near-zero?)
- If sparse: CS can compress much more than JL
- [ ] **3.3b** CS measurement matrix → L2 norm → constellation
- Compare: CS vs JL at same target dimension
### 3.4 Spherical Harmonics
**Formula**: Y_l^m(θ,φ), complete basis on S², (l_max+1)² coefficients
**Properties**: Deterministic, native Fourier on sphere, exactly invertible
- [ ] **3.4a** Expand constellation triangulation profile in spherical harmonics
- Which angular frequencies carry discriminative info?
- [ ] **3.4b** Spherical harmonic coefficients of embedding distribution as class signature
- [ ] **3.4c** Hyperspherical harmonics for S^15 and S^43 (higher-dim generalization)
---
## CATEGORY 4: INVERTIBLE GEOMETRIC TRANSFORMS
### 4.1 Stereographic Projection
**Formula**: σ(x) = x_{1:n}/(1−x_{n+1}), σ⁻¹(y) = (2y, ‖y‖²−1)/(‖y‖²+1)
**Properties**: Conformal bijection S^n\{pole} ↔ ℝⁿ, preserves angles
- [ ] **4.1a** Stereographic → Euclidean scattering → inverse stereographic → S^d
- Apply scattering in flat space, project back to sphere
- [ ] **4.1b** Stereographic projection as constellation readout alternative
- Instead of triangulation distances, read local coordinates via stereographic
### 4.2 Exponential / Logarithmic Maps
**Formula**: exp_p(v) = cos(‖v‖)·p + sin(‖v‖)·v/‖v‖
**Formula**: log_p(q) = arccos(⟨q,p⟩) · (q−⟨q,p⟩p)/‖q−⟨q,p⟩p‖
**Properties**: Deterministic, locally invertible, O(n)
- [ ] **4.2a** Replace triangulation (1−cos) with log map coordinates at each anchor
- Log map gives direction + distance in tangent space (richer than scalar distance)
- Each anchor contributes d-dim tangent vector instead of 1-d distance
- [ ] **4.2b** Log map triangulation → parallel transport to common tangent space → aggregate
- Geometrically principled alternative to patchwork concatenation
### 4.3 Parallel Transport
**Formula**: Γ^q_p(v) = v − (⟨v,p⟩+⟨v,q⟩/(1+⟨p,q⟩))·(p+q) on S^n
**Properties**: Isometric between tangent spaces, exactly invertible
- [ ] **4.3a** Compute log maps at K anchors → parallel transport all to north pole → aggregate
- Creates a canonical tangent-space representation independent of anchor positions
- [ ] **4.3b** Parallel transport as inter-anchor communication in constellation
- How does the same input look from different anchor tangent spaces?
### 4.4 Möbius Transformations
**Formula**: h_ω(z) = [(1−‖ω‖²)/‖z−ω‖²](z−ω) − ω
**Properties**: Conformal automorphism of S^d, invertible, O(d)
- [ ] **4.4a** Möbius "geometric attention": transform sphere to zoom into anchor regions
- Expand region near anchor, compress far regions
- Each anchor applies its own Möbius transform before measuring distance
- [ ] **4.4b** Composition of Möbius transforms as normalizing flow on S^d
- Learned flow that warps embedding distribution toward better separation
### 4.5 Procrustes + Polar Decomposition
**Formula**: R* = argmin_R ‖RA−B‖_F = UVᵀ from SVD(BᵀA)
**Formula**: A = UP (rotation × stretch)
- [ ] **4.5a** Procrustes-align channel cloud to canonical pose before Cholesky/SVD
- Remove rotation variability, isolate shape information
- [ ] **4.5b** Polar decomposition of channel matrix: U (rotation) + P (stretch) as separate features
- U encodes orientation of frequency cloud; P encodes shape/scale
- Both are geometric, both are deterministic from the channel matrix
---
## CATEGORY 5: MATRIX DECOMPOSITION SIGNATURES
### 5.1 Already Tested
- [X] Cholesky of Gram matrix → 36 lower-tri values (in v4, working)
- [X] SVD singular values → 8 values (in v4, working)
- [X] Concatenated 44-d signature on S^43 → 46.8% with CE-only
### 5.2 Remaining Decompositions
- [ ] **5.2a** QR decomposition: Q (rotation) and R diagonal (scale per channel)
- R diagonal = per-channel magnitude; Q = inter-channel angular structure
- [ ] **5.2b** Schur decomposition: T diagonal = eigenvalues, T off-diagonal = coupling
- For the Gram matrix: Schur gives eigenstructure in triangular form
- [ ] **5.2c** Eigendecomposition of Gram: eigenvalues as spectral signature
- Compare: eigenvalues vs SVD singular values vs Cholesky diagonal
- These are related but not identical (λ_i = σ_i² for Gram = AᵀA)
- [ ] **5.2d** NMF of magnitude spectrum: parts-based decomposition
- Requires iterative optimization (not fully deterministic)
- But finds additive, non-negative parts — texture components
- [ ] **5.2e** Tucker tensor decomposition of spatial×frequency×channel tensor
- 3D structure: (H, W, freq_bins) per color channel
- Core tensor encodes interactions between spatial, frequency, channel modes
---
## CATEGORY 6: INFORMATION-THEORETIC LOSSES
### 6.1 Already Tested
- [X] InfoNCE (self-contrastive, two augmented views) — dead at 0.15 in spectral v4
- [X] CosineEmbeddingLoss — frozen at 0.346 (margin-saturated)
- [X] CV loss (Cayley-Menger volume) — running but not in 0.18-0.25 band
### 6.2 Loss Modifications
- [ ] **6.2a** Drop contrastive losses entirely, CE-only + geometric losses
- v4 shows CE is the only contributor; contrastive is dead weight
- Hypothesis: removing dead losses may speed convergence
- [ ] **6.2b** Class-conditional InfoNCE: positive = same class, not same image
- Requires labels but gives much stronger supervision signal
- [ ] **6.2c** vMF-based contrastive loss: replace dot-product similarity with vMF log-likelihood
- κ-adaptive: high-κ for nearby pairs, low-κ for far pairs
- [ ] **6.2d** Fisher-Rao distance as loss: d_FR(p,q) = 2·arccos(∫√(pq))
- Natural distance for distributions on the sphere
- [ ] **6.2e** Sliced spherical Wasserstein distance as distribution matching loss
- Matches embedding distribution to target (e.g., uniform on sphere)
- [ ] **6.2f** Geometric autograd (from GM3): tangential projection + separation preservation
- Adam + geometric autograd > AdamW on geometric tasks (proven)
- Operates on gradient direction, not loss value
### 6.3 Anchor Management
- [ ] **6.3a** Anchor push frequency sweep: every 10, 25, 50, 100, 200 batches
- [ ] **6.3b** Anchor push with vMF-weighted centroids instead of hard class centroids
- [ ] **6.3c** Anchor birth/death: add anchors where density is high, remove where unused
- [ ] **6.3d** Anchor dropout sweep: 0%, 5%, 15%, 30%, 50%
---
## CATEGORY 7: COMPOSITE PIPELINE TESTS
### 7.1 The Reference Pipeline (from research article)
- [ ] **7.1a** Scattering(J=2,L=8) → JL(128) → L2 norm → constellation(64) → classify
- The "canonical" pipeline; expected ~75-80% based on literature
- [ ] **7.1b** Same as 7.1a but with learned 2-layer projection replacing JL
- Minimal learned params (~16K), test if projection adaptation matters
- [ ] **7.1c** Scattering → curvelet energy → concat → JL → constellation
- Test complementarity
### 7.2 Hybrid: Spectral + Scattering
- [ ] **7.2a** STFT channels (v4) + scattering features → concat → JL → S^d → constellation
- STFT gives spatial-frequency; scattering gives multi-scale invariants
- [ ] **7.2b** Scattering → Cholesky Gram + SVD signature → constellation
- Apply v4's geometric signature to scattering output instead of STFT
### 7.3 Multi-Signature Constellation
- [ ] **7.3a** Parallel extraction: scattering + Gabor + Radon → separate constellations → fusion
- Each primitive captures different geometric aspect
- Fusion: concatenate patchwork outputs → shared classifier
- [ ] **7.3b** Hierarchical constellation: scattering → coarse anchors → residual → fine anchors
- Two-stage: first stage identifies broad category, second refines
### 7.4 Minimal Learned Params Tests
- [ ] **7.4a** Best deterministic pipeline + 1 learned linear layer (d_in → 128) before constellation
- Measure: how much does a single projection layer help?
- Count: exact learned param count
- [ ] **7.4b** Same as 7.4a but with SquaredReLU + LayerNorm (the proven patchwork block)
- [ ] **7.4c** Sweep learned projection sizes: 0, 1K, 5K, 10K, 50K, 100K params
- Find the elbow where adding params stops helping
---
## PRIORITY QUEUE (recommended execution order)
### Tier 1: Highest Expected Impact
1. **1.1a** — Scattering + flat constellation (the literature leader)
2. **1.1b** — Scattering + JL → S^127 + constellation
3. **6.2a** — Drop dead contrastive losses from v4, measure CE-only ceiling
4. **2.4a** — vMF soft assignment replacing hard nearest-anchor
5. **4.2a** — Log map triangulation (richer than scalar distance)
### Tier 2: High Expected Impact
6. **7.1a** — Full reference pipeline
7. **1.1f** — Scattering hybrid with minimal learned projection
8. **1.2b** — Gabor spatial statistics → S^127
9. **5.2c** — Eigendecomposition vs SVD vs Cholesky ablation
10. **2.1b** — Quaternionic Hopf S⁷→S⁴ for 8-channel data
### Tier 3: Exploratory
11. **1.5a** — Persistent homology standalone
12. **3.1b** — RFF on scattering features
13. **4.4a** — Möbius geometric attention
14. **7.3a** — Multi-signature parallel constellations
15. **2.2a** — Grassmannian class subspaces
### Tier 4: Deep Exploration
16. **1.3a** — Radon cloud on S^d
17. **1.4b** — Curvelet + scattering concat
18. **2.3a** — Flag decomposition of frequency channels
19. **4.3a** — Parallel transport aggregation
20. **3.4c** — Hyperspherical harmonics analysis
---
## RUNNING SCOREBOARD
| Experiment | Val Acc | Params (learned) | CV | Anchors Active | InfoNCE | Key Finding |
|---|---|---|---|---|---|---|
| Linear baseline | 67.0% | 423K | — | — | — | Overfits E31 |
| MLP baseline | 65.0% | 687K | — | — | — | Overfits E10 |
| Core CE-only | 63.4% | 820K | 0.70 | — | — | CV never converges |
| Core CE+CV | 62.7% | 820K | 0.61 | — | — | CV hurts accuracy |
| Full GELU | 88.0% | 1.6M | 0.14-0.17 | 64/64 | 1.00 | Reference |
| Full SquaredReLU | 88.0% | 1.6M | 0.15 | 64/64 | 1.00 | Matches GELU |
| Spectral v1 (flat FFT) | FAIL | — | — | 1/64 | — | Norm mismatch |
| Spectral v2 (per-band) | ~35% | 1.2M | 0.17-0.19 | 900/3072 | 0.45 | Too diffuse |
| Spectral v3 (sph mean) | ~27% | 130K | 0.27-0.34 | 110/128 | 0.35 | Collapsed to point |
| Spectral v4 (STFT+Chol+SVD) | 46.8% | 137K | 0.52-0.66 | 53/64 | 0.15 | CE-only carry |
| *Scattering baseline* | *~82%** | *0* | *—* | *—* | *—* | *Literature (SVM)* |
*Italicized entries are literature values, not our runs*
---
## NOTES & INSIGHTS
### Why contrastive losses die on deterministic encoders
The STFT/FFT faithfully reports every pixel-level difference between augmented views.
Two crops of the same image produce signatures as different as two different images.
Without a learned layer to absorb augmentation variance, InfoNCE has nothing to align.
Solutions: (a) augmentation-invariant features (scattering), (b) thin learned projection,
(c) class-conditional contrastive (6.2b), (d) drop contrastive entirely (6.2a).
### The Cholesky insight
L diagonal encodes "new angular information per tier given all lower tiers."
This IS discriminative (proved by v4 reaching 46.8% with CE alone).
The 44-d signature on S^43 carries real inter-channel geometry.
Next question: is the STFT front-end the bottleneck, or the 44-d signature?
### Scattering is the clear next step
82% on CIFAR-10 with zero learned params (literature) vs our 46.8%.
Scattering is translation-invariant AND deformation-stable (Lipschitz).
This directly addresses the augmentation sensitivity problem.
kymatio provides GPU-accelerated PyTorch implementation.
### The dimension question
S^15 (band_dim=16) vs S^43 (signature) vs S^127 (conv encoder output)
E₈ lattice gives 240 optimal anchors on S^7
Proven CV attractor at ~0.20 is on S^15
Need to test which target sphere dimension is optimal for spectral features
---
*Last updated: 2026-03-18, session with Opus*
*Next: run scattering baseline (1.1a), then decide pipeline direction* |