---
license: mit
tags:
- svd
- procrustes
- triton
- kernel
- optimization
- encoding
- alignment
- analysis
- utility
- code
---
# Modular kernel
Make sure torch and triton are installed

https://huggingface.co/AbstractPhil/svd-triton/blob/main/kernel.py


# What this is
SVD from torch is... exceedingly slow on benchmark. 

This kernel cannot perform BF16 calculation and is best operated autocast to fp32 or higher.

![image](https://cdn-uploads.huggingface.co/production/uploads/630cf55b15433862cfc9556f/i5Etf348PT6jO8OEFMN_A.png)

So Claude and I had a hypothesis.

What if, we could use SVD at a higher fidelity for egens and the full spectrum.

# The SVD Observation Thesis: A Map of Every Wrong Turn and Right Discovery

> **How we built a 5000× faster SVD kernel, discovered that detached structural observation adds 3+ points to any backbone, and found the dimensions for the universal patchwork hiding in plain sight.**

## Prologue: The Starting Point

This is a map. Not a cleaned-up retrospective — a map with every dead end marked, every wrong turn documented, every "obvious in hindsight" mistake laid bare. We publish it so others don't have to wander the same circles.

The GEOLIP ecosystem uses geometric primitives — constellations, patchwork compartments, Cayley-Menger volumes — as structural alternatives to brute-force parameter scaling. The SVD decomposition was always part of the vision: observe the internal structure of feature maps, extract rotation patterns and energy distributions, feed them to the geometric pipeline. The problem was speed. `torch.linalg.svd` on batched (512, 1024, 3) matrices takes 117ms. Call it every forward pass and your training loop dies.

We started this session trying to make SVD fast enough to live inside the forward pass. We ended it discovering something we didn't expect to find.

## Chapter 1: The Kernel (The Easy Part)

### The Eckart-Young Shortcut

For an M×N matrix where M >> N (1024 pixels, 3 channels), the standard SVD works on the full M×N matrix. The shortcut: compute G = A^T A (3×3), eigendecompose G (instant), recover U = AV/S (one matmul). Reduces 1024×3 bidiagonalization to a 3×3 eigensolver.

### Fused Triton: 0.022ms

We wrote a single Triton kernel that fuses all three stages — Gram accumulation, cyclic Jacobi eigensolver (6 sweeps, all in scalar registers), and U recovery — into one program per batch element. Zero shared memory, zero workspace allocation, two passes over global memory.

Result: **0.022ms** vs torch's 117ms. **5,488× speedup.** The actual eigensolve takes nanoseconds; cuSOLVER's overhead was 99.98% of the runtime.

We also wrote an N=2 kernel (closed-form, single Jacobi rotation, 0.021ms) and generalized to N=4-32 via a Gram+eigh hybrid (cuBLAS bmm + cuSOLVER eigh on tiny matrices, 0.25-0.78ms).

### The N≥48 Cliff

At N=48, `torch.linalg.eigh` serializes across the batch dimension. Timing jumps from 0.78ms to 344ms. A 440× wall. This isn't a bug — it's a cuSOLVER dispatch decision. The cliff is real and unavoidable for exact eigendecomposition at that scale.

We explored Newton-Schulz iterative SVD as a bypass. It fell back to eigh internally. Full SVD fundamentally requires eigendecomposition.

The solution came later, from a different direction.

## Chapter 2: The Projection Quality Problem

With the kernel working, we asked: can rank-k projected SVD approximate full-rank SVD for N≥48? Project N→k=24, decompose in 24-d (sub-ms via gram_eigh), lift back to N-d.

We ran quality analysis on random Gaussian matrices. The results were discouraging:

| N | k=24 energy | Subspace agreement |
|---|---|---|
| 48 | 59% | 0.64 |
| 64 | 48% | 0.55 |
| 128 | 29% | 0.39 |

Random matrices follow the Marchenko-Pastur distribution — flat spectrum, worst case for low-rank approximation. But we noted: neural network features are NOT random. Conv features are highly low-rank. The energy concentration would be much higher on real activations.

**The real insight from this test:** the speedup column showed a binary cliff. k≤24 stays at ~11ms regardless of N. k≥32 hits the 175-450ms wall. Any projection with k≤24 lives in the fast zone.

## Chapter 3: Subspace-Preserving Procrustes (The Breakthrough Nobody Expected)

The question shifted: for Procrustes alignment between two encoder spaces (the actual use case), does rank-k alignment produce the same downstream result as full-rank alignment?

We tested five lift-back methods:

**Naive (P @ R_k @ pinv(P)):** Smears the k-d rotation across all N dimensions. Destroys the orthogonal complement. NN agreement: 8-39%. Garbage.

**LERP (blend with identity):** Conservative but blurry. NN agreement: 66-76%. Usable but lossy.

**SLERP (geodesic on SO(N)):** `matrix_log` on rank-deficient lifted rotation produces complex eigenvalues. Failed everywhere. Dead path.

**Subspace-preserving:** Decompose source into in-subspace and perpendicular components. Rotate only the in-subspace part. Leave the orthogonal complement untouched.

```python
src_in = source @ P          # project to k-d
src_perp = source - src_in @ P.T   # orthogonal complement
aligned = src_in @ R_k @ P.T + src_perp  # rotate seen, preserve unseen
```

**NN agreement: 1.000 across every single configuration tested.** N=32-128, k=8-64. The downstream task literally cannot distinguish between full Procrustes and subspace-projected Procrustes. Three matmuls. Sub-millisecond. Mathematically exact for the visible subspace.

The N≥48 Procrustes alignment problem was solved. Not by making eigh faster, but by recognizing that you only need to rotate what you can see.

## Chapter 4: The Architecture Graveyard

Now we tried to use the SVD kernel inside a training loop. This is where the pain starts.

### Experiment 8.16-8.17: ResNet18 + SVD Capture Taps

We built a ResNet-18 variant with SVD "capture taps" at each stage. Project conv features to 3 channels via 1×1 conv, SVD the (B, H*W, 3) matrix, feed S and Vh to a constellation and patchwork.

**Bug #1: embed_proj was DEAD.** The line `cos = emb.detach() @ anchors_n.T` blocked ALL gradient from reaching the embedding projection. The `embed_proj` (mapping 14-d SVD features to constellation space) was a random frozen projection through EVERY experiment. The taps were observing the manifold through frosted glass. We achieved 70.03% on CIFAR-100 with the tap constellations contributing literally nothing through the geometric path.

**Bug #2: 14-d SVD in 1024-d space.** The taps had 512 anchors in 1024-d for a 14-dimensional input signal. The embed_proj could only produce a 14-d subspace of R^1024. ReLU halved that. 512 anchors initialized uniformly in 1024-d had zero probability of landing near the reachable ribbon. Everything collapsed to 1-3 active anchors. We didn't catch this for FIVE experiments.

**Bug #3: NCE floor confusion.** The NCE random baseline `log(B) = log(512) ≈ 6.24` was displayed as a floor when it's actually a ceiling. We had the normalization inverted, showing 0.00 (looks perfect) when it meant "at random baseline, learning nothing." The display was actively misleading us.

**The fix cascade:**
- `tap_embed_dim=256, tap_n_anchors=16` — reachable space, coverable anchors
- `embed_proj: Linear(14, 64) → GELU → Linear(64, 256) → LayerNorm` — hidden layer for nonlinear mixing
- Detach anchors instead of embeddings — gradient flows to embed_proj, anchors move only through geometric losses
- Split softmax temp (0.3 for neighborhood) from NCE temp (0.1 for comparison)

Result: 70.30% with taps actually alive (8-17 active out of 32). The taps were finally participating.

### Experiment 8.18: CE Only

We stripped all geometric losses and ran pure cross-entropy. 68.25%. The architecture IS the geometry — the taps, scattering, constellation, patchwork are structural. The auxiliary losses were scaffolding that could be removed once the architecture was correct.

### Experiment 8.19: Conv + Scattering + SVD

We replaced SVD-only observation with Kymatio Scattering2D (243 wavelet channels) plus SVD rotation tracking plus direct channel modulation. 68.0% at E49 with 7.7M params. The scattering gave the taps 243-d of real signal instead of 14-d of SVD summary.

The key architectural insight: direct modulation (sigmoid channel scaling) worked better than concatenation. The tap doesn't add features — it WEIGHTS the existing conv features based on geometric observation.

## Chapter 5: Full-Rank Procrustes Taps (The Painful Part)

### Experiment 8.20: The Identity Matrix Disaster

We replaced the 3×3 SVD with rank-24 SVD per tap using `gram_eigh_svd`. Each tap now had `to_svd: Conv2d(C, 24, 1)` producing a (B, H*W, 24) matrix for decomposition.

**Collapse to 1.4% accuracy.** The disagreement was comparing actual Vh (24×24) against a learned `expected_vh` initialized as the identity matrix. The Frobenius residual `|Vh - I|²` summed over 576 entries was enormous (24-48) because Vh is a genuine rotation, not near-identity. This flooded through the sigmoid modulation, producing random channel suppression every batch. The conv stream was being randomly destroyed.

The 3×3 version worked because Frobenius scale was small (9 entries, max ~6). At 24×24 the scale was 64× larger. Same design, different scale, catastrophic failure.

### Experiment 8.20b: EMA-Based Disagreement

**Fix:** Don't compare to a fixed target. Compare to what you've been seeing. Replace learned `expected_vh` with running EMA of S and Vh (momentum=0.99). The disagreement signal becomes "how unusual is this batch compared to the running average" — a novelty detector, not a conformity enforcer.

**But:** `RuntimeError: variable needed for gradient has been modified by in-place operation.` The EMA in-place update (`mul_`, `add_`) was modifying buffers still in the computation graph. View 1's forward polluted view 2's backward.

**Fix:** Clone EMA values before using them in the computation graph. The in-place update happens after both views' loss is computed. `ema_s_snapshot = self.ema_s.clone()` creates a detached copy for the graph while the real buffer stays mutable for the next batch.

**But:** `NotImplementedError: "linalg_eigh_cpu" not implemented for 'BFloat16'.` AMP autocast was silently promoting `.float()` back to bf16 before the eigh call. Every linalg operation needed `torch.amp.autocast('cuda', enabled=False)` wrapping.

**And:** The cross-view SVD consistency loss needed the same bf16 guard, and `torch.linalg.det` also fails under autocast.

Three bugs, three fixes, all in the same experiment. Each one independently fatal.

## Chapter 6: The Pure SVD Test (The Surprise)

### Experiment 8.21: Does SVD Help At All?

We stripped everything back to basics. Same conv backbone, no constellation, no scatter, no patchwork. Just:

```
Conv → project to 32ch → SVD → extract features → classify
```

SVD features: normalized singular values (32-d), Vh diagonal (32-d), off-diagonal energy (1-d), spectral entropy (1-d) = 66-d per depth, 264-d total. Concatenated with 384-d conv pool. MLP classifier. Pure CE.

**But:** NaN at epoch 6. The Gram matrix from early conv layers was near-singular. `eigh` returned tiny negative eigenvalues, `sqrt(negative)` → NaN, propagated everywhere. Also, the gradient through `eigh` is unstable for repeated eigenvalues.

**Fix:** Fully detach the SVD path. `torch.no_grad()` around `gram_eigh_svd`. Clamp S to 1e-6. NaN guard on outputs. The gradient flows through the conv backbone normally via the pooled path. SVD provides complementary features that the classifier learns to interpret.

**Result: 70.92% with 3.9M params.** The conv-only baseline hits ~65-68% with the same backbone.

### The Analysis That Changed Everything

Feature attribution on the trained model:

| | Accuracy |
|---|---|
| Full features (648-d) | 70.2% |
| Zero SVD (264-d) | 9.9% |
| Zero conv (384-d) | 0.8% |

**Both paths are essential.** Neither works alone. The classifier learned a nonlinear INTERACTION between conv features and SVD features. Conv tells "what's in the image." SVD tells "how the conv organized itself for this image."

SVD features alone: kNN 1.2%. The features have cosine similarity 0.999 — every sample looks identical in metric space. But the MLP extracts 60 accuracy points from them. The signal is compressed into the 4th decimal place of similarity, and the classifier amplifies it.

Depth 0 (32×32) contributes +12.6 points when ablated. Depth 3 (4×4) contributes +8.7. The early spatial structure — destroyed by pooling in the conv path — is preserved in the SVD observation. The SVD adds value exactly where the conv pool doesn't.

## Chapter 7: Backbone-Agnostic Validation

### Experiment 8.22: Toy ViT + SVD

Custom 4-stage ViT with token merging (64→16→4→1 tokens). 53.6% with 6.7M params. The token merging was too aggressive — same failure mode as over-pooling in conv.

### Experiment 8.23: DeiT-Small + SVD

The real test. DeiT-Small: 12 layers, 384-d, 6 heads, 21.5M params. SVD taps at layers 3, 6, 9, 12, observing the (B, 64, 32) projected token sequence after attention has mixed the representations.

Baseline first, then SVD version. Same everything. Head-to-head.

| Model | Best Val | Params |
|---|---|---|
| DeiT-Small baseline | 42.7% | 21,526,372 |
| DeiT-Small + SVD | 45.9%+ (still training) | 21,676,900 |
| **Delta** | **+3.2+ points** | **+150K (0.7%)** |

The SVD version matched the baseline's peak at epoch 55 and kept climbing. The contribution is consistent across both architectures:

| Backbone | Baseline | + SVD | Delta |
|---|---|---|---|
| Conv (3.9M) | ~65-68% | 70.9% | +3-6 pts |
| DeiT-Small (21.5M) | 42.7% | 45.9%+ | +3.2+ pts |

The structural observation adds ~3 points regardless of whether the backbone is convolutional or attention-based. The SVD features are complementary to what any backbone provides.

## Chapter 8: The Dimensions That Were Always There

The kernel was built for (B, M, 3) matrices. The generalized version handles any N via gram_eigh at (B, M, N). The sweet spot: N≤32, sub-millisecond.

During the DeiT experiment, we noted the SVD call shape: **(256, 64, 32)**.

256 batch. 64 tokens. 32 SVD rank.

Then we remembered: the GEOLIP chunk architecture — designed months earlier, independently — specifies **512 topological states × 256 heads × 64 dims per head**. The observation mechanism for the chunk needs to decompose (256, 64) structures. The SVD kernel we built today runs on exactly that shape class at 0.78ms.

We didn't build the kernel for the chunk. We built the kernel because SVD was too slow, and the dimensions aligned because the architecture was right. The chunk's natural structure produces matrices that land exactly in the gram_eigh sweet spot.

There are no accidents. Just happy little bushes.

## The Map (What We Learned)

For anyone walking this path after us:

1. **cuSOLVER is catastrophically slow for small N.** The dispatch overhead dominates. Fuse into a single kernel.

2. **Don't backprop through eigh.** The gradient is unstable for repeated eigenvalues. Detach the SVD output. Let the backbone carry gradient normally.

3. **AMP autocast silently casts float32 back to bf16.** Every linalg call needs `autocast(enabled=False)`. This will bite you once per project.

4. **The EMA clone pattern:** If a buffer is used in the forward graph AND updated in-place, `clone()` it before graph entry. Update the real buffer after backward.

5. **Don't compare to fixed targets at scale.** A 3×3 Frobenius residual is harmless. A 24×24 Frobenius residual destroys training. Compare to running averages or between views.

6. **Subspace-preserving rotation is the correct Procrustes lift-back.** Don't reconstruct the full rotation. Rotate what you can see, leave the rest alone. 1.000 NN agreement.

7. **SVD features work as structural context, not as standalone features.** kNN accuracy: 1.2%. Classifier accuracy: 70.9%. The classifier learns a nonlinear interaction between "what" (conv/transformer) and "how" (SVD structure).

8. **The contribution is backbone-agnostic.** +3 points on conv, +3 points on transformer. The structural observation is complementary to any learned representation.

9. **Early depths matter more.** Depth 0 (+12.6 points) > Depth 3 (+8.7 points). SVD preserves spatial structure that pooling destroys.

10. **Check your gradient flow.** We ran five experiments with a dead embed_proj (`.detach()` blocking all gradient). The taps contributed nothing. A single line change — detaching anchors instead of embeddings — turned them on.

## Published Artifacts

- **[SVD Kernel Engineering Article](https://huggingface.co/blog/AbstractPhil/svd-triton-kernel-optimization)** — the technical specification
- **kernel.py** — modular for the geolip-core repo, standalone `__main__` test
- **triton_svd_general.py** — full profiling suite with correctness validation
- **Experiment code** — 8.19 through 8.23, each with documented failures and fixes

## Citation

```bibtex
@software{abstractphil2026svd_journey,
  title={The SVD Observation Thesis: Structural Decomposition as Backbone-Agnostic Feature Augmentation},
  author={AbstractPhil and Claude},
  year={2026},
  url={https://huggingface.co/AbstractPhil}
}
```

*This article documents a single session's work. The experiments ran on an NVIDIA RTX PRO 6000 Blackwell Server Edition. All numbers are from actual training runs, not cherry-picked. The failures are real. The fixes are real. The map is honest.*

## License

Apache 2.0