Cortexelus Claude Opus 4.7 (1M context) commited on
Commit
97651ef
·
1 Parent(s): 5f27021

DiT engines: replace BF16 with FP16-mixed (FP32 islands) + add FP32 variants

Browse files

Drop dit_bf16.trt for all three SA3 DiTs (sm-music, sm-sfx, sa3-m). BF16
quantization error compounds over 8 pingpong sampling steps, drifting
cos-sim vs PT FP32 from 0.99 (single step) to 0.81 (final latent) and
producing audibly degraded output.

Replace with dit_fp16mixed.trt (the canonical from now on): FP16 trunk
with FP32 islands around every RMSNorm chain (Pow+ReduceMean+Sqrt+Mul),
every attention Softmax, and the RoPE region (anything reachable from a
Cast(to=FP32) feeding Cos/Sin/Einsum). 140 RMSNorms + 40 Softmaxes per
sm-music block, more for medium. Built with STRONGLY_TYPED so TRT honors
the explicit dtypes (no auto-promotion). Matches MLX's "FP16 with implicit
FP32 reductions via fused kernels" recipe.

Per-step cos-sim vs PT FP32: 0.99997 single-step, 0.998 over 8 steps.
RMS-curve correlation: 0.998. Audio basically indistinguishable from
FP32 PyTorch reference at 1/15 the wall time (43 ms vs 630 ms).

Also adds dit_fp32.trt (1.8 GB sm-*, 5.8 GB medium) for users who want
bit-for-bit parity with the PyTorch reference at the cost of ~3x slower
inference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tensorRT/sm_90/sa3-m/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:382d5a1703c262a8c3c4ad8b4d12088fbec64f0d438224586d25ec2e5a843d9e
3
- size 2919180828
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:399f3fa18e21f86528322a4543fed17999f6bd95589886f2d4f8f3e2c77fc425
3
+ size 2914585244
tensorRT/sm_90/sa3-m/dit_fp32.trt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6787419c134298f40bcd3a7e4b3fbc9427b880dc801e2a66a85a39fd326f964
3
+ size 5820343524
tensorRT/sm_90/sa3-sm-music/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:ccaa56b55add49864765a29fea858819d635c00da6a7fbafe0b1b293446fda09
3
- size 935362708
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb7be6e8d74392f4acfd954098c098f0c6f82d171870497b270e8a8281cb25f3
3
+ size 935602284
tensorRT/sm_90/sa3-sm-music/dit_fp32.trt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35ea9b5f039abfc0b1fd2294ab814a9e26da0567342e29a700da1ee85ab4636d
3
+ size 1842306180
tensorRT/sm_90/sa3-sm-sfx/{dit_bf16.trt → dit_fp16mixed.trt} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8b356bc539acba1ac80ac569aa02aa774820bbca94d925be578eeda734dc9e6d
3
- size 935367708
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1646a9d73ffe75098bd29a1e51770c38f2e70115eb86c2bbbcb7f7f2a9c89b82
3
+ size 935536684
tensorRT/sm_90/sa3-sm-sfx/dit_fp32.trt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e63e18d15e97689ad2afd8a2adf55b30f79568aa9114dba0bbf50931c9a2c3cf
3
+ size 1842314452