Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.
The result is 3β5Γ speedup compared to naive PyTorch baseline π₯
Benchmarks: - SmallRerank (B=32, C=10): up to 3.2Γ (M3 Pro) / 2.8Γ (A100) - HeavyRerank (B=32, C=100): up to 3.8Γ (M3 Pro) / 5.3Γ (A100) - LongDocStress (Ld=1024): up to 6.2Γ (L4)