DiT engines: replace BF16 with FP16-mixed (FP32 islands) + add FP32 variants

Drop dit_bf16.trt for all three SA3 DiTs (sm-music, sm-sfx, sa3-m). BF16
quantization error compounds over 8 pingpong sampling steps, drifting
cos-sim vs PT FP32 from 0.99 (single step) to 0.81 (final latent) and
producing audibly degraded output.

Replace with dit_fp16mixed.trt (the canonical from now on): FP16 trunk
with FP32 islands around every RMSNorm chain (Pow+ReduceMean+Sqrt+Mul),
every attention Softmax, and the RoPE region (anything reachable from a
Cast(to=FP32) feeding Cos/Sin/Einsum). 140 RMSNorms + 40 Softmaxes per
sm-music block, more for medium. Built with STRONGLY_TYPED so TRT honors
the explicit dtypes (no auto-promotion). Matches MLX's "FP16 with implicit
FP32 reductions via fused kernels" recipe.

Per-step cos-sim vs PT FP32: 0.99997 single-step, 0.998 over 8 steps.
RMS-curve correlation: 0.998. Audio basically indistinguishable from
FP32 PyTorch reference at 1/15 the wall time (43 ms vs 630 ms).

Also adds dit_fp32.trt (1.8 GB sm-*, 5.8 GB medium) for users who want
bit-for-bit parity with the PyTorch reference at the cost of ~3x slower
inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (6) hide show

tensorRT/sm_90/sa3-m/{dit_bf16.trt → dit_fp16mixed.trt} +2 -2
tensorRT/sm_90/sa3-m/dit_fp32.trt +3 -0
tensorRT/sm_90/sa3-sm-music/{dit_bf16.trt → dit_fp16mixed.trt} +2 -2
tensorRT/sm_90/sa3-sm-music/dit_fp32.trt +3 -0
tensorRT/sm_90/sa3-sm-sfx/{dit_bf16.trt → dit_fp16mixed.trt} +2 -2
tensorRT/sm_90/sa3-sm-sfx/dit_fp32.trt +3 -0

tensorRT/sm_90/sa3-m/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:382d5a1703c262a8c3c4ad8b4d12088fbec64f0d438224586d25ec2e5a843d9e
-size 2919180828

 version https://git-lfs.github.com/spec/v1
+oid sha256:399f3fa18e21f86528322a4543fed17999f6bd95589886f2d4f8f3e2c77fc425
+size 2914585244

tensorRT/sm_90/sa3-m/dit_fp32.trt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f6787419c134298f40bcd3a7e4b3fbc9427b880dc801e2a66a85a39fd326f964
+size 5820343524

tensorRT/sm_90/sa3-sm-music/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ccaa56b55add49864765a29fea858819d635c00da6a7fbafe0b1b293446fda09
-size 935362708

 version https://git-lfs.github.com/spec/v1
+oid sha256:bb7be6e8d74392f4acfd954098c098f0c6f82d171870497b270e8a8281cb25f3
+size 935602284

tensorRT/sm_90/sa3-sm-music/dit_fp32.trt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35ea9b5f039abfc0b1fd2294ab814a9e26da0567342e29a700da1ee85ab4636d
+size 1842306180

tensorRT/sm_90/sa3-sm-sfx/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:8b356bc539acba1ac80ac569aa02aa774820bbca94d925be578eeda734dc9e6d
-size 935367708

 version https://git-lfs.github.com/spec/v1
+oid sha256:1646a9d73ffe75098bd29a1e51770c38f2e70115eb86c2bbbcb7f7f2a9c89b82
+size 935536684

tensorRT/sm_90/sa3-sm-sfx/dit_fp32.trt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e63e18d15e97689ad2afd8a2adf55b30f79568aa9114dba0bbf50931c9a2c3cf
+size 1842314452