| ================================================================================ |
| Generalized Batched Thin SVD β Profiling Suite |
| Device: NVIDIA RTX PRO 6000 Blackwell Server Edition |
| ================================================================================ |
|
|
| ====================================================================== |
| CORRECTNESS VALIDATION (B=64, M=1024) |
| ====================================================================== |
| [auto] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS] |
| [triton] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS] |
| [auto] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS] |
| [triton] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS] |
| [auto] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS] |
| [gram] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS] |
| [auto] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS] |
| [gram] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS] |
| [auto] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS] |
| [gram] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS] |
| [auto] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] |
| [gram] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] |
| [newton] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS] |
| [auto] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] |
| [gram] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] |
| [newton] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS] |
| [auto] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] |
| [gram] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] |
| [newton] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS] |
| [auto] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] |
| [gram] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] |
| [newton] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS] |
| [auto] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] |
| [gram] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] |
| [newton] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS] |
| [auto] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] |
| [gram] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] |
| [newton] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS] |
| [auto] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] |
| [gram] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] |
| [newton] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS] |
| [auto] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] |
| [gram] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] |
| [newton] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS] |
|
|
| ALL PASSED |
|
|
| ======================================================================================================================== |
| PROCRUSTES ALIGNMENT: 5 methods of applying rank-k rotation to N-d space |
| cos = mean cosine similarity after alignment (higher = better, full = ceiling) |
| NN = nearest-neighbor agreement with full Procrustes (1.0 = identical downstream) |
| ======================================================================================================================== |
|
|
| N=32: |
| k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β nn_pv nn_lr nn_sl nn_ss |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 0.4359 0.2142 0.4248 0.3 0.2142 err 0.4299 0.4215 β 0.177 0.681 0.177 1.000 |
| 16 0.4370 0.2967 0.4259 0.3 0.2967 err 0.4316 0.4252 β 0.300 0.678 0.300 1.000 |
| 24 0.4405 0.3864 0.4365 0.3 0.3864 err 0.4369 0.4384 β 0.555 0.772 0.555 1.000 |
|
|
| N=48: |
| k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β nn_pv nn_lr nn_sl nn_ss |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 0.4421 0.1764 0.4306 0.3 0.1764 err 0.4350 0.4192 β 0.102 0.702 0.102 1.000 |
| 16 0.4422 0.2494 0.4290 0.3 0.2494 err 0.4354 0.4292 β 0.230 0.667 0.230 1.000 |
| 24 0.4432 0.3047 0.4294 0.3 0.3047 err 0.4366 0.4315 β 0.326 0.676 0.326 1.000 |
| 32 0.4476 0.3621 0.4397 0.3 0.3621 err 0.4429 0.4425 β 0.454 0.728 0.454 1.000 |
|
|
| N=64: |
| k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β nn_pv nn_lr nn_sl nn_ss |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 0.4475 0.1602 0.4356 0.3 0.1602 err 0.4390 0.4323 β 0.102 0.708 0.102 1.000 |
| 16 0.4444 0.2178 0.4300 0.3 0.2178 err 0.4355 0.4299 β 0.164 0.658 0.164 1.000 |
| 24 0.4453 0.2678 0.4295 0.3 0.2678 err 0.4363 0.4332 β 0.241 0.665 0.241 1.000 |
| 32 0.4468 0.3091 0.4324 0.3 0.3091 err 0.4390 0.4374 β 0.312 0.680 0.312 1.000 |
|
|
| N=96: |
| k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β nn_pv nn_lr nn_sl nn_ss |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 16 0.4267 0.1644 0.4035 0.3 0.1644 err 0.4077 0.4020 β 0.132 0.721 0.132 1.000 |
| 24 0.4259 0.2023 0.4014 0.3 0.2023 err 0.4069 0.4034 β 0.200 0.709 0.200 1.000 |
| 32 0.4241 0.2363 0.3996 0.3 0.2363 err 0.4057 0.4056 β 0.241 0.688 0.241 1.000 |
| 48 0.4238 0.2978 0.4050 0.3 0.2978 err 0.4080 0.4139 β 0.394 0.717 0.394 1.000 |
|
|
| N=128: |
| k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β nn_pv nn_lr nn_sl nn_ss |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 16 0.4068 0.1380 0.3740 0.3 0.1380 err 0.3770 0.3763 β 0.129 0.757 0.129 1.000 |
| 24 0.4072 0.1679 0.3733 0.3 0.1679 err 0.3774 0.3778 β 0.169 0.739 0.169 1.000 |
| 32 0.4064 0.1860 0.3730 0.3 0.1860 err 0.3778 0.3736 β 0.217 0.723 0.217 1.000 |
| 48 0.4073 0.2397 0.3783 0.3 0.2397 err 0.3812 0.3868 β 0.310 0.733 0.310 1.000 |
| 64 0.4102 0.2781 0.3853 0.3 0.2781 err 0.3880 0.3937 β 0.394 0.729 0.394 1.000 |
|
|
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| WINNER PER CONFIG (closest cos to full, highest NN agreement): |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| N= 32 k= 8: best_cos=subspace (0.4299, gap=0.0060) best_nn=subspace (1.000) |
| N= 32 k= 16: best_cos=subspace (0.4316, gap=0.0054) best_nn=subspace (1.000) |
| N= 32 k= 24: best_cos=subspace (0.4369, gap=0.0037) best_nn=subspace (1.000) |
| N= 48 k= 8: best_cos=subspace (0.4350, gap=0.0071) best_nn=subspace (1.000) |
| N= 48 k= 16: best_cos=subspace (0.4354, gap=0.0068) best_nn=subspace (1.000) |
| N= 48 k= 24: best_cos=subspace (0.4366, gap=0.0066) best_nn=subspace (1.000) |
| N= 48 k= 32: best_cos=subspace (0.4429, gap=0.0047) best_nn=subspace (1.000) |
| N= 64 k= 8: best_cos=subspace (0.4390, gap=0.0085) best_nn=subspace (1.000) |
| N= 64 k= 16: best_cos=subspace (0.4355, gap=0.0089) best_nn=subspace (1.000) |
| N= 64 k= 24: best_cos=subspace (0.4363, gap=0.0090) best_nn=subspace (1.000) |
| N= 64 k= 32: best_cos=subspace (0.4390, gap=0.0078) best_nn=subspace (1.000) |
| N= 96 k= 16: best_cos=subspace (0.4077, gap=0.0190) best_nn=subspace (1.000) |
| N= 96 k= 24: best_cos=subspace (0.4069, gap=0.0190) best_nn=subspace (1.000) |
| N= 96 k= 32: best_cos=subspace (0.4057, gap=0.0184) best_nn=subspace (1.000) |
| N= 96 k= 48: best_cos=subspace (0.4080, gap=0.0158) best_nn=subspace (1.000) |
| N=128 k= 16: best_cos=subspace (0.3770, gap=0.0298) best_nn=subspace (1.000) |
| N=128 k= 24: best_cos=subspace (0.3774, gap=0.0298) best_nn=subspace (1.000) |
| N=128 k= 32: best_cos=subspace (0.3778, gap=0.0286) best_nn=subspace (1.000) |
| N=128 k= 48: best_cos=subspace (0.3812, gap=0.0261) best_nn=subspace (1.000) |
| N=128 k= 64: best_cos=subspace (0.3880, gap=0.0222) best_nn=subspace (1.000) |
|
|
| ==================================================================================================== |
| PROJECTION QUALITY ANALYSIS β B=256, M=1024 |
| Question: can rank-k SVD approximate rank-N SVD? |
| ==================================================================================================== |
|
|
| N=32: |
| k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 30.99% 8.65e-01 8.31e-01 0.5622 0.4432 7.849ms 0.508ms 0.1x |
| 12 44.74% 7.89e-01 7.43e-01 0.4606 0.5508 10.556ms 0.508ms 0.0x |
| 16 57.56% 7.05e-01 6.51e-01 0.3379 0.6432 11.222ms 0.508ms 0.0x |
| 24 80.59% 4.41e-01 4.41e-01 0.0000 1.0000 0.510ms 0.508ms 1.0x |
|
|
| N=48: |
| k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 22.33% 9.11e-01 8.81e-01 0.7880 0.3642 7.901ms 172.136ms 21.8x |
| 12 32.39% 8.65e-01 8.22e-01 0.6575 0.4454 10.668ms 172.136ms 16.1x |
| 16 41.87% 8.15e-01 7.62e-01 0.4125 0.5193 11.490ms 172.136ms 15.0x |
| 24 59.24% 7.05e-01 6.38e-01 0.3178 0.6433 11.497ms 172.136ms 15.0x |
| 32 74.71% 5.76e-01 5.03e-01 0.3076 0.7575 180.615ms 172.136ms 1.0x |
|
|
| N=64: |
| k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 17.83% 9.34e-01 9.06e-01 0.9635 0.3152 7.917ms 182.058ms 23.0x |
| 12 25.91% 9.00e-01 8.61e-01 0.6937 0.3898 10.693ms 182.058ms 17.0x |
| 16 33.58% 8.64e-01 8.15e-01 0.6025 0.4484 11.311ms 182.058ms 16.1x |
| 24 47.78% 7.89e-01 7.23e-01 0.3495 0.5505 11.207ms 182.058ms 16.2x |
| 32 60.64% 7.05e-01 6.27e-01 0.3116 0.6438 176.453ms 182.058ms 1.0x |
| 48 82.74% 4.99e-01 4.15e-01 0.3090 0.8138 204.625ms 182.058ms 0.9x |
|
|
| N=96: |
| k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 13.09% 9.56e-01 9.32e-01 1.2033 0.2583 8.035ms 295.451ms 36.8x |
| 16 24.83% 9.11e-01 8.67e-01 0.8721 0.3637 11.426ms 295.451ms 25.9x |
| 24 35.57% 8.64e-01 8.02e-01 0.5587 0.4475 11.238ms 295.451ms 26.3x |
| 32 45.45% 8.15e-01 7.38e-01 0.4710 0.5163 175.186ms 295.451ms 1.7x |
| 48 62.97% 7.05e-01 6.08e-01 0.3243 0.6407 200.525ms 295.451ms 1.5x |
| 64 77.83% 5.75e-01 4.71e-01 0.3073 0.7578 306.531ms 295.451ms 1.0x |
|
|
| N=128: |
| k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 8 10.60% 9.68e-01 9.46e-01 1.4678 0.2251 8.085ms 436.551ms 54.0x |
| 16 20.19% 9.34e-01 8.93e-01 1.0025 0.3145 11.509ms 436.551ms 37.9x |
| 24 29.04% 9.00e-01 8.42e-01 0.7155 0.3867 11.432ms 436.551ms 38.2x |
| 32 37.26% 8.64e-01 7.92e-01 0.5374 0.4447 174.994ms 436.551ms 2.5x |
| 48 52.05% 7.89e-01 6.93e-01 0.3598 0.5498 198.286ms 436.551ms 2.2x |
| 64 64.91% 7.05e-01 5.92e-01 0.3121 0.6407 305.364ms 436.551ms 1.4x |
| 96 85.61% 4.99e-01 3.79e-01 0.3011 0.8136 452.623ms 436.551ms 1.0x |
|
|
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| SUMMARY: Recommended target_rank per N |
| (β₯99% energy, β₯0.99 subspace cos, best speedup) |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| N= 32: best k= 24 β 80.6% energy, subspace=1.0000 (below 99% threshold) |
| N= 48: best k= 32 β 74.7% energy, subspace=0.7575 (below 99% threshold) |
| N= 64: best k= 48 β 82.7% energy, subspace=0.8138 (below 99% threshold) |
| N= 96: best k= 64 β 77.8% energy, subspace=0.7578 (below 99% threshold) |
| N=128: best k= 96 β 85.6% energy, subspace=0.8136 (below 99% threshold) |
|
|
| ============================================================================================================== |
| N-DIMENSION SWEEP β NVIDIA RTX PRO 6000 Blackwell Server Edition |
| B=512, M=1024 |
| ============================================================================================================== |
| N Triton Gram Newton Projβ24 Projβ16 Torch Best Speedup |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| 2 0.020ms 0.227ms β β β 79.040ms triton 3859.1x |
| 3 0.022ms 0.242ms β β β 118.394ms triton 5394.2x |
| 4 β 0.255ms β β β 125.263ms gram 490.6x |
| 5 β 0.258ms β β β 144.426ms gram 560.8x |
| 6 β 0.269ms β β β 155.042ms gram 576.9x |
| 7 β 0.280ms β β β 163.771ms gram 584.2x |
| 8 β 0.291ms 0.290ms β β 168.934ms newton 582.1x |
| 10 β 0.380ms 0.379ms β β 190.292ms newton 502.2x |
| 12 β 0.400ms 0.400ms β β 213.394ms gram 534.1x |
| 16 β 0.429ms 0.428ms β β 230.670ms newton 538.6x |
| 20 β 0.597ms 0.596ms β β 253.657ms newton 425.6x |
| 24 β 0.651ms 0.651ms β 0.652ms 272.293ms newton 418.5x |
| 32 β 0.795ms 0.794ms 0.800ms 22.025ms 303.023ms newton 381.8x |
| 48 β 344.049ms 344.202ms 22.439ms 22.481ms 550.746ms proj24 24.5x |
| 64 β 365.206ms 365.148ms 21.749ms 22.173ms 609.352ms proj24 28.0x |
| 96 β 590.636ms 590.664ms 21.862ms 22.353ms 973.819ms proj24 44.5x |
| 128 β 868.144ms 868.262ms 22.085ms 22.469ms 1421.924ms proj24 64.4x |
|
|
| ================================================================================ |
| SUMMARY |
| ================================================================================ |
|
|
| Strategy by N: |
| N=2: Fused Triton (closed-form Jacobi rotation) |
| N=3: Fused Triton (cyclic Jacobi in registers) |
| N=4-32: Gram + eigh (bmm + cuSOLVER eigh) β sub-ms |
| N=48+: Projected SVD (Nβk, cheap SVD, lift back) β check quality table |
|
|
| Standalone utilities: |
| newton_schulz_invsqrt(G) β batched G^{-1/2} via pure bmm |
| projected_svd(A, target_rank=k) β rank-k approximate SVD |
| projected_svd_quality(A, target_rank) β measure approximation quality |
|
|
| Key question answered: energy_ratio and subspace_cos in quality table |
|
|
| Results saved to svd_general_profile.json |
| ================================================================================ |