Add comment explaining _coeffs_list and Polar Express vs former NS [skip-build]

Documents what _coeffs_list is (precomputed Polar Express coefficients, minimax-optimal
via Remez/equioscillation), and contrasts with the former hardcoded NS coefficients:
former produced US'V^T with scattered singular values; Polar Express converges to
the exact polar factor UV^T. Also removes unused loguru import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (1) hide show

torch-ext/optimizer/newton_schulz.py +12 -1

torch-ext/optimizer/newton_schulz.py CHANGED Viewed

@@ -4,7 +4,6 @@ from math import inf, sqrt
 import numpy as np
 from .matmul_transpose_triton import matmul_transpose_assign
-from loguru import logger
 COMM_DTYPE = torch.bfloat16
 DEFAULT_CHUNK_SIZE_RATIO = 4
@@ -78,6 +77,18 @@ def _optimal_composition(l, num_iters, safety_factor_eps=0, cushion=0):
     return coefficients
 _coeffs_list = _optimal_composition(l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02)
 # This code is adapted from:

 import numpy as np
 from .matmul_transpose_triton import matmul_transpose_assign
 COMM_DTYPE = torch.bfloat16
 DEFAULT_CHUNK_SIZE_RATIO = 4
     return coefficients
+# Precomputed Polar Express coefficients (a, b, c) for 10 quintic Newton-Schulz
+# iterations. Each tuple is the minimax-optimal (Remez/equioscillation) odd quintic
+# approximant to x->1 over the current singular-value interval, computed once at
+# import time and reused across all optimizer steps.
+#
+# Contrast with the former hardcoded NS coefficients (5 fixed tuples):
+#   - Former: empirically tuned to maximize slope at zero; did not converge
+#     singular values to 1, yielding US'V^T with S' ~ Uniform(0.5, 1.5) instead
+#     of the true polar factor UV^T.
+#   - Polar Express: analytically optimal per step, adapting to the shrinking
+#     singular-value interval [l, u] as iterations progress; converges all
+#     singular values to 1, producing the exact polar factor UV^T.
 _coeffs_list = _optimal_composition(l=1e-3, num_iters=10, safety_factor_eps=1e-2, cushion=0.02)
 # This code is adapted from: