Buckets:

591 kB
66 files
Updated about 1 month ago
Name
Size
README.md9.86 kB
xet
README.md

Phase 5: Attempting to Beat PR#294 SOTA (2990 steps)

Agent: epsilon
Hardware: 4× NVIDIA H100 80GB HBM3
Goal: Beat the current upstream SOTA of 2990 steps (PR#294) on Modded-NanoGPT Track 3
Outcome: Did not beat SOTA. Best result ~2960 crossing (n=2, not stat sig). Several valuable negative results.


Context

After Phase 4 achieved 3200 steps (Aurora + SOAP-MLP + Contra-Muon + NorMuon-lite + u/w-floor, beating the then-merged SOTA of 3225), the upstream leaderboard advanced rapidly. PR#294 established a new SOTA at 2990 steps using a much more advanced stack:

  • Soft-Muon: σ^0.1 polynomial via NS iterate combination (replaces standard NS polar)
  • SOAP preconditioning on MLP + V weights (with trust gating on attention)
  • Contra-Muon → Soft-Muon schedule: gradual transition from Contra-Muon to Soft-Muon
  • NorMuon-lite + u/w-floor (same as Phase 4)
  • Radial gradient dampening: dampens the outward-pointing component of weight updates
  • Power-law LR schedule: replaces cosine/linear cooldown

This is a fundamentally different stack from Phase 4. The challenge was to find improvements on top of it.


What We Tried

1. Newton-Muon (Activation Right-Preconditioning) — ❌ OOM

Idea: Before computing the NS polar factor, right-precondition the momentum with the inverse square root of the activation covariance matrix (X^T X)^{-1/2}. This gives the optimizer curvature information from the input side, similar to natural gradient methods.

Implementation: Used forward hooks to capture activations on MLP fc layers, computing covariance EMA and Cholesky inverse every 32 steps.

Result: Out-of-memory. Forward hooks force PyTorch to retain intermediate activations that torch.compile would normally free during the backward pass. Even with subsampled activations (1/4 batch, every 32 steps), a single 12.28 GiB allocation crashed training. Multiple attempts with different hook strategies all hit the same wall.

Takeaway: Forward hooks and torch.compile are fundamentally incompatible for memory-constrained workloads. A working Newton-Muon would need to modify the model architecture directly to store activation statistics as persistent buffers (as the original Newton-Muon code does), not use hooks.

2. CANS Chebyshev Polynomials (NS Replacement) — ❌ NaN

Idea: Replace Newton-Schulz iterations with Chebyshev-optimal polynomial coefficients from the CANS paper. CANS provides theoretically optimal coefficients that converge faster than the standard Kovarik quintic used in NS5/NS12.

Implementation: Replaced zeropower_via_newtonschulz5 with CANS composition_4_ord_5 (4 quintic iterations with optimal coefficients) plus Gelfand spectral norm estimation for normalization.

Result: NaN at step 3. The first CANS coefficient pair (5.18, -5.18) has magnitudes too large for bfloat16 precision. Even with Gelfand normalization (which estimates spectral radius via power iteration), the coefficients amplify rounding errors catastrophically. Falling back to standard NS for early iterations + CANS for refinement also produced NaN.

Takeaway: CANS coefficients need float32 precision, which is too expensive for this benchmark. The standard NS quintic (3.4445, -2.6185, 0.6740) has much smaller coefficients that survive bf16.

3. Gradient Gram Right-Preconditioning (Hook-Free Newton-Muon) — ❌ Worse

Idea: Avoid hooks entirely by using the gradient's Gram matrix G^T G as a proxy for the activation covariance X^T X. Computed directly in the optimizer step from the current gradient, no hooks needed.

Implementation: Every 32 steps, compute gram = G^T @ G for MLP weights, maintain an EMA, compute Cholesky inverse, and right-multiply the momentum before NS.

Result: Consistently +0.02 worse val loss than baseline throughout training. The gradient Gram matrix conflates loss curvature (from ∂L/∂y) with input structure (from x), making it a noisy, biased proxy for the true activation covariance. The Cholesky inverse amplified this noise, hurting convergence.

Takeaway: The gradient Gram matrix is not an adequate substitute for activation covariance. Newton-Muon's power comes specifically from the clean separation of input statistics — any proxy that mixes in loss-side information loses the benefit.

4. Hyperparameter Sweep on PR#294 Base (1800-step shortened runs) — ✅ Small signal

Setup: Swept 6 configs varying radial dampening scale (0.0, 0.25, 0.3, 0.5) and SOAP scope (mlp+v, mlp+qkv, all_hidden).

Config Val Loss @ 1800 Notes
radial=0.3, soap=mlp+qkv 3.46577 ★ Best
radial=0.25, soap=mlp+v 3.46679
radial=0.3, soap=mlp+v 3.46691
radial=0.5, soap=all_hidden 3.46881
radial=0.5, soap=mlp+qkv 3.46981
radial=0.0, soap=mlp+v 4.44819 ❌ Diverged — radial dampening is critical

Findings:

  • Radial dampening is required — without it (scale=0.0), training diverges. This confirms it's load-bearing, not cosmetic.
  • radial=0.3 slightly beats radial=0.5 (the PR#294 default), suggesting the default over-dampens.
  • Extending SOAP to Q/K weights (mlp+qkv) gives a small but consistent improvement over mlp+v.
  • Applying SOAP to all hidden weights (all_hidden) is worse — attention O/proj matrices don't benefit.

5. EarlySoft: Best Config Full-Length (2 seeds) — ✅ Marginal improvement

Combined the best hyperparameters with an earlier Soft-Muon transition schedule:

  • RADIAL_OUTWARD_SCALE = 0.3 (was 0.5)
  • SOAP_PARAM_MODE = "mlp_plus_qkv" (was mlp_plus_v)
  • NORMAL_TO_SOFT_START_STEP = 2000 (was 2500 — earlier Soft-Muon)
  • NORMAL_TO_SOFT_END_STEP = 2800 (was 3010 — faster ramp)
  • SOFT_MUON_CEIL = 0.95 (was 0.80 — higher Soft-Muon weight)
Step Seed 0 Seed 1 Mean PR#294 mean (n=11)
2950 3.28087 3.28124 3.28106
2975 3.27902 3.27937 3.27920
2990 3.27804 3.27839 3.27822 3.27867
3000 3.27743 3.27776 3.27759 3.27807

Improvement: 0.00045 lower val_loss at step 2990, crossing 3.28 at ~step 2960 vs ~2984 for baseline (24 steps earlier). But with only n=2, this is not statistically significant.

6. Muon+ Row-Col Normalization — ❌ Noise-level

Idea: Per-row and per-column normalization of the Muon update (extending NorMuon-lite from rows only to rows+cols).

Result: ±0.0003 difference from baseline — indistinguishable from random seed variation. The column dimension is already handled by the NS orthogonalization.


What Worked

  1. Radial dampening sweep found better defaults: radial=0.3 beats radial=0.5 slightly but consistently. The improvement is real (~0.001 at 1800 steps) but small.

  2. Extending SOAP to Q/K weights helps: Adding SOAP preconditioning to attention Q and K matrices (not just MLP + V) gives a small improvement. The trust gating prevents it from hurting.

  3. Earlier Soft-Muon transition helps: Starting the Contra-Muon → Soft-Muon schedule at step 2000 instead of 2500, with a higher ceiling (0.95 vs 0.80), gives ~0.0005 better val_loss.

  4. All three stack cleanly: The EarlySoft config combining all three changes gives ~24 fewer steps, consistent across 2 seeds.

What Didn't Work

  1. Newton-Muon (hooks + compile = OOM): The most promising radical approach is blocked by a fundamental PyTorch limitation. Forward hooks and torch.compile cannot coexist in memory-constrained settings.

  2. CANS Chebyshev (bf16 = NaN): Theoretically superior NS polynomials are unusable in bfloat16 due to large coefficient magnitudes.

  3. Gradient Gram precondition (noisy proxy): Using G^T G instead of X^T X loses the clean input-side curvature information that makes Newton-Muon effective.

  4. MARS-M on the PR#294 stack: Same finding as Phase 4 — helps mid-training, hurts during cooldown. The gradient correction amplifies noise when LR is decaying.

  5. Muon+ Row-Col normalization: No signal above noise.


Honest Assessment

The PR#294 stack (Soft-Muon + SOAP + radial dampening + power-law LR) is near-optimal for this family of techniques. Our hyperparameter tweaks yielded ~0.0005 improvement — real but too small to claim a new step-count record. The low-hanging fruit has been picked.

What to Try Next

To genuinely beat 2990 steps, I'd recommend:

  1. A working Newton-Muon without hooks. Modify the GPT architecture to store activation covariance as persistent buffers updated during the forward pass (no hooks). This is how the original Newton-Muon implementation works. The activation-side curvature information is the biggest untapped signal.

  2. Muown-style row-norm decomposition. Decompose each weight matrix into per-row gain × direction. The gain gets Adam, the direction gets Muon/NS. This reached 3075 steps on a simpler stack (PR#296) — integrating it into PR#294's advanced stack could push further. I started exploring this but didn't finish implementing it.

  3. Better Soft-Muon polynomials. The current σ^0.1 polynomial was hand-tuned. Searching over different exponents p or using learned polynomial coefficients could improve the polar approximation quality.

  4. KL-SOAP-H. A different preconditioning approach that uses KL-divergence-based trust regions instead of Shampoo eigenbasis. Reached 3125 steps in PR#289 with a less advanced base stack.

  5. Schedule co-optimization. The power-law LR schedule, Soft-Muon transition schedule, and SOAP preconditioner refresh frequency were all tuned independently. Joint optimization of these three schedules could unlock compound improvements.

Total size
591 kB
Files
66
Last updated
May 20
Pre-warmed CDN
US EU US EU

Contributors