svd-triton / kernel_profile_out.txt
AbstractPhil's picture
Create kernel_profile_out.txt
eb88c5a verified
================================================================================
Generalized Batched Thin SVD β€” Profiling Suite
Device: NVIDIA RTX PRO 6000 Blackwell Server Edition
================================================================================
======================================================================
CORRECTNESS VALIDATION (B=64, M=1024)
======================================================================
[auto] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS]
[triton] N= 2: S_err=1.91e-05 recon=9.54e-07 (ref=4.83e-06) orth=1.43e-06 desc=True [PASS]
[auto] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS]
[triton] N= 3: S_err=4.01e-05 recon=2.38e-06 (ref=8.34e-06) orth=1.13e-06 desc=True [PASS]
[auto] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS]
[gram] N= 4: S_err=4.01e-05 recon=2.38e-06 (ref=9.06e-06) orth=1.73e-06 desc=True [PASS]
[auto] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS]
[gram] N= 5: S_err=5.15e-05 recon=3.81e-06 (ref=9.30e-06) orth=1.79e-06 desc=True [PASS]
[auto] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS]
[gram] N= 6: S_err=6.29e-05 recon=2.86e-06 (ref=1.24e-05) orth=1.67e-06 desc=True [PASS]
[auto] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
[gram] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
[newton] N= 8: S_err=9.54e-05 recon=3.58e-06 (ref=1.50e-05) orth=1.67e-06 desc=True [PASS]
[auto] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
[gram] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
[newton] N= 10: S_err=8.39e-05 recon=4.05e-06 (ref=1.41e-05) orth=1.67e-06 desc=True [PASS]
[auto] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
[gram] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
[newton] N= 16: S_err=1.41e-04 recon=4.29e-06 (ref=2.57e-05) orth=1.91e-06 desc=True [PASS]
[auto] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
[gram] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
[newton] N= 32: S_err=1.79e-04 recon=3.67e-06 (ref=3.17e-05) orth=2.03e-06 desc=True [PASS]
[auto] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
[gram] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
[newton] N= 48: S_err=3.05e-04 recon=4.24e-05 (ref=4.74e-05) orth=4.46e-06 desc=True [PASS]
[auto] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
[gram] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
[newton] N= 64: S_err=4.27e-04 recon=5.72e-05 (ref=6.32e-05) orth=5.24e-06 desc=True [PASS]
[auto] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
[gram] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
[newton] N= 96: S_err=1.17e-03 recon=1.07e-04 (ref=9.39e-05) orth=2.74e-06 desc=True [PASS]
[auto] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
[gram] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
[newton] N=128: S_err=1.42e-03 recon=1.63e-04 (ref=1.27e-04) orth=4.17e-06 desc=True [PASS]
ALL PASSED
========================================================================================================================
PROCRUSTES ALIGNMENT: 5 methods of applying rank-k rotation to N-d space
cos = mean cosine similarity after alignment (higher = better, full = ceiling)
NN = nearest-neighbor agreement with full Procrustes (1.0 = identical downstream)
========================================================================================================================
N=32:
k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
─────────────────────────────────────────────────────────────────────────────────────────────────────────
8 0.4359 0.2142 0.4248 0.3 0.2142 err 0.4299 0.4215 β”‚ 0.177 0.681 0.177 1.000
16 0.4370 0.2967 0.4259 0.3 0.2967 err 0.4316 0.4252 β”‚ 0.300 0.678 0.300 1.000
24 0.4405 0.3864 0.4365 0.3 0.3864 err 0.4369 0.4384 β”‚ 0.555 0.772 0.555 1.000
N=48:
k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
─────────────────────────────────────────────────────────────────────────────────────────────────────────
8 0.4421 0.1764 0.4306 0.3 0.1764 err 0.4350 0.4192 β”‚ 0.102 0.702 0.102 1.000
16 0.4422 0.2494 0.4290 0.3 0.2494 err 0.4354 0.4292 β”‚ 0.230 0.667 0.230 1.000
24 0.4432 0.3047 0.4294 0.3 0.3047 err 0.4366 0.4315 β”‚ 0.326 0.676 0.326 1.000
32 0.4476 0.3621 0.4397 0.3 0.3621 err 0.4429 0.4425 β”‚ 0.454 0.728 0.454 1.000
N=64:
k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
─────────────────────────────────────────────────────────────────────────────────────────────────────────
8 0.4475 0.1602 0.4356 0.3 0.1602 err 0.4390 0.4323 β”‚ 0.102 0.708 0.102 1.000
16 0.4444 0.2178 0.4300 0.3 0.2178 err 0.4355 0.4299 β”‚ 0.164 0.658 0.164 1.000
24 0.4453 0.2678 0.4295 0.3 0.2678 err 0.4363 0.4332 β”‚ 0.241 0.665 0.241 1.000
32 0.4468 0.3091 0.4324 0.3 0.3091 err 0.4390 0.4374 β”‚ 0.312 0.680 0.312 1.000
N=96:
k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
─────────────────────────────────────────────────────────────────────────────────────────────────────────
16 0.4267 0.1644 0.4035 0.3 0.1644 err 0.4077 0.4020 β”‚ 0.132 0.721 0.132 1.000
24 0.4259 0.2023 0.4014 0.3 0.2023 err 0.4069 0.4034 β”‚ 0.200 0.709 0.200 1.000
32 0.4241 0.2363 0.3996 0.3 0.2363 err 0.4057 0.4056 β”‚ 0.241 0.688 0.241 1.000
48 0.4238 0.2978 0.4050 0.3 0.2978 err 0.4080 0.4139 β”‚ 0.394 0.717 0.394 1.000
N=128:
k full pinv lerp (Ξ±) slerp (Ξ±) subspc stay_k β”‚ nn_pv nn_lr nn_sl nn_ss
─────────────────────────────────────────────────────────────────────────────────────────────────────────
16 0.4068 0.1380 0.3740 0.3 0.1380 err 0.3770 0.3763 β”‚ 0.129 0.757 0.129 1.000
24 0.4072 0.1679 0.3733 0.3 0.1679 err 0.3774 0.3778 β”‚ 0.169 0.739 0.169 1.000
32 0.4064 0.1860 0.3730 0.3 0.1860 err 0.3778 0.3736 β”‚ 0.217 0.723 0.217 1.000
48 0.4073 0.2397 0.3783 0.3 0.2397 err 0.3812 0.3868 β”‚ 0.310 0.733 0.310 1.000
64 0.4102 0.2781 0.3853 0.3 0.2781 err 0.3880 0.3937 β”‚ 0.394 0.729 0.394 1.000
═════════════════════════════════════════════════════════════════════════════════════════════════════════
WINNER PER CONFIG (closest cos to full, highest NN agreement):
═════════════════════════════════════════════════════════════════════════════════════════════════════════
N= 32 k= 8: best_cos=subspace (0.4299, gap=0.0060) best_nn=subspace (1.000)
N= 32 k= 16: best_cos=subspace (0.4316, gap=0.0054) best_nn=subspace (1.000)
N= 32 k= 24: best_cos=subspace (0.4369, gap=0.0037) best_nn=subspace (1.000)
N= 48 k= 8: best_cos=subspace (0.4350, gap=0.0071) best_nn=subspace (1.000)
N= 48 k= 16: best_cos=subspace (0.4354, gap=0.0068) best_nn=subspace (1.000)
N= 48 k= 24: best_cos=subspace (0.4366, gap=0.0066) best_nn=subspace (1.000)
N= 48 k= 32: best_cos=subspace (0.4429, gap=0.0047) best_nn=subspace (1.000)
N= 64 k= 8: best_cos=subspace (0.4390, gap=0.0085) best_nn=subspace (1.000)
N= 64 k= 16: best_cos=subspace (0.4355, gap=0.0089) best_nn=subspace (1.000)
N= 64 k= 24: best_cos=subspace (0.4363, gap=0.0090) best_nn=subspace (1.000)
N= 64 k= 32: best_cos=subspace (0.4390, gap=0.0078) best_nn=subspace (1.000)
N= 96 k= 16: best_cos=subspace (0.4077, gap=0.0190) best_nn=subspace (1.000)
N= 96 k= 24: best_cos=subspace (0.4069, gap=0.0190) best_nn=subspace (1.000)
N= 96 k= 32: best_cos=subspace (0.4057, gap=0.0184) best_nn=subspace (1.000)
N= 96 k= 48: best_cos=subspace (0.4080, gap=0.0158) best_nn=subspace (1.000)
N=128 k= 16: best_cos=subspace (0.3770, gap=0.0298) best_nn=subspace (1.000)
N=128 k= 24: best_cos=subspace (0.3774, gap=0.0298) best_nn=subspace (1.000)
N=128 k= 32: best_cos=subspace (0.3778, gap=0.0286) best_nn=subspace (1.000)
N=128 k= 48: best_cos=subspace (0.3812, gap=0.0261) best_nn=subspace (1.000)
N=128 k= 64: best_cos=subspace (0.3880, gap=0.0222) best_nn=subspace (1.000)
====================================================================================================
PROJECTION QUALITY ANALYSIS β€” B=256, M=1024
Question: can rank-k SVD approximate rank-N SVD?
====================================================================================================
N=32:
k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
────────────────────────────────────────────────────────────────────────────────────────────────
8 30.99% 8.65e-01 8.31e-01 0.5622 0.4432 7.849ms 0.508ms 0.1x
12 44.74% 7.89e-01 7.43e-01 0.4606 0.5508 10.556ms 0.508ms 0.0x
16 57.56% 7.05e-01 6.51e-01 0.3379 0.6432 11.222ms 0.508ms 0.0x
24 80.59% 4.41e-01 4.41e-01 0.0000 1.0000 0.510ms 0.508ms 1.0x
N=48:
k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
────────────────────────────────────────────────────────────────────────────────────────────────
8 22.33% 9.11e-01 8.81e-01 0.7880 0.3642 7.901ms 172.136ms 21.8x
12 32.39% 8.65e-01 8.22e-01 0.6575 0.4454 10.668ms 172.136ms 16.1x
16 41.87% 8.15e-01 7.62e-01 0.4125 0.5193 11.490ms 172.136ms 15.0x
24 59.24% 7.05e-01 6.38e-01 0.3178 0.6433 11.497ms 172.136ms 15.0x
32 74.71% 5.76e-01 5.03e-01 0.3076 0.7575 180.615ms 172.136ms 1.0x
N=64:
k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
────────────────────────────────────────────────────────────────────────────────────────────────
8 17.83% 9.34e-01 9.06e-01 0.9635 0.3152 7.917ms 182.058ms 23.0x
12 25.91% 9.00e-01 8.61e-01 0.6937 0.3898 10.693ms 182.058ms 17.0x
16 33.58% 8.64e-01 8.15e-01 0.6025 0.4484 11.311ms 182.058ms 16.1x
24 47.78% 7.89e-01 7.23e-01 0.3495 0.5505 11.207ms 182.058ms 16.2x
32 60.64% 7.05e-01 6.27e-01 0.3116 0.6438 176.453ms 182.058ms 1.0x
48 82.74% 4.99e-01 4.15e-01 0.3090 0.8138 204.625ms 182.058ms 0.9x
N=96:
k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
────────────────────────────────────────────────────────────────────────────────────────────────
8 13.09% 9.56e-01 9.32e-01 1.2033 0.2583 8.035ms 295.451ms 36.8x
16 24.83% 9.11e-01 8.67e-01 0.8721 0.3637 11.426ms 295.451ms 25.9x
24 35.57% 8.64e-01 8.02e-01 0.5587 0.4475 11.238ms 295.451ms 26.3x
32 45.45% 8.15e-01 7.38e-01 0.4710 0.5163 175.186ms 295.451ms 1.7x
48 62.97% 7.05e-01 6.08e-01 0.3243 0.6407 200.525ms 295.451ms 1.5x
64 77.83% 5.75e-01 4.71e-01 0.3073 0.7578 306.531ms 295.451ms 1.0x
N=128:
k Energy% Recon_proj Recon_trunc S_rel_err Subspace Proj ms Full ms Speedup
────────────────────────────────────────────────────────────────────────────────────────────────
8 10.60% 9.68e-01 9.46e-01 1.4678 0.2251 8.085ms 436.551ms 54.0x
16 20.19% 9.34e-01 8.93e-01 1.0025 0.3145 11.509ms 436.551ms 37.9x
24 29.04% 9.00e-01 8.42e-01 0.7155 0.3867 11.432ms 436.551ms 38.2x
32 37.26% 8.64e-01 7.92e-01 0.5374 0.4447 174.994ms 436.551ms 2.5x
48 52.05% 7.89e-01 6.93e-01 0.3598 0.5498 198.286ms 436.551ms 2.2x
64 64.91% 7.05e-01 5.92e-01 0.3121 0.6407 305.364ms 436.551ms 1.4x
96 85.61% 4.99e-01 3.79e-01 0.3011 0.8136 452.623ms 436.551ms 1.0x
──────────────────────────────────────────────────────────────────────
SUMMARY: Recommended target_rank per N
(β‰₯99% energy, β‰₯0.99 subspace cos, best speedup)
──────────────────────────────────────────────────────────────────────
N= 32: best k= 24 β†’ 80.6% energy, subspace=1.0000 (below 99% threshold)
N= 48: best k= 32 β†’ 74.7% energy, subspace=0.7575 (below 99% threshold)
N= 64: best k= 48 β†’ 82.7% energy, subspace=0.8138 (below 99% threshold)
N= 96: best k= 64 β†’ 77.8% energy, subspace=0.7578 (below 99% threshold)
N=128: best k= 96 β†’ 85.6% energy, subspace=0.8136 (below 99% threshold)
==============================================================================================================
N-DIMENSION SWEEP β€” NVIDIA RTX PRO 6000 Blackwell Server Edition
B=512, M=1024
==============================================================================================================
N Triton Gram Newton Proj→24 Proj→16 Torch Best Speedup
──────────────────────────────────────────────────────────────────────────────────────────────────────────
2 0.020ms 0.227ms β€” β€” β€” 79.040ms triton 3859.1x
3 0.022ms 0.242ms β€” β€” β€” 118.394ms triton 5394.2x
4 β€” 0.255ms β€” β€” β€” 125.263ms gram 490.6x
5 β€” 0.258ms β€” β€” β€” 144.426ms gram 560.8x
6 β€” 0.269ms β€” β€” β€” 155.042ms gram 576.9x
7 β€” 0.280ms β€” β€” β€” 163.771ms gram 584.2x
8 β€” 0.291ms 0.290ms β€” β€” 168.934ms newton 582.1x
10 β€” 0.380ms 0.379ms β€” β€” 190.292ms newton 502.2x
12 β€” 0.400ms 0.400ms β€” β€” 213.394ms gram 534.1x
16 β€” 0.429ms 0.428ms β€” β€” 230.670ms newton 538.6x
20 β€” 0.597ms 0.596ms β€” β€” 253.657ms newton 425.6x
24 β€” 0.651ms 0.651ms β€” 0.652ms 272.293ms newton 418.5x
32 β€” 0.795ms 0.794ms 0.800ms 22.025ms 303.023ms newton 381.8x
48 β€” 344.049ms 344.202ms 22.439ms 22.481ms 550.746ms proj24 24.5x
64 β€” 365.206ms 365.148ms 21.749ms 22.173ms 609.352ms proj24 28.0x
96 β€” 590.636ms 590.664ms 21.862ms 22.353ms 973.819ms proj24 44.5x
128 β€” 868.144ms 868.262ms 22.085ms 22.469ms 1421.924ms proj24 64.4x
================================================================================
SUMMARY
================================================================================
Strategy by N:
N=2: Fused Triton (closed-form Jacobi rotation)
N=3: Fused Triton (cyclic Jacobi in registers)
N=4-32: Gram + eigh (bmm + cuSOLVER eigh) β€” sub-ms
N=48+: Projected SVD (N→k, cheap SVD, lift back) — check quality table
Standalone utilities:
newton_schulz_invsqrt(G) β€” batched G^{-1/2} via pure bmm
projected_svd(A, target_rank=k) β€” rank-k approximate SVD
projected_svd_quality(A, target_rank) β€” measure approximation quality
Key question answered: energy_ratio and subspace_cos in quality table
Results saved to svd_general_profile.json
================================================================================