Post
3114
Releasing my first kernel π₯ MaxSim
Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.
The result is 3β5Γ speedup compared to naive PyTorch baseline π₯
Benchmarks:
- SmallRerank (B=32, C=10): up to 3.2Γ (M3 Pro) / 2.8Γ (A100)
- HeavyRerank (B=32, C=100): up to 3.8Γ (M3 Pro) / 5.3Γ (A100)
- LongDocStress (Ld=1024): up to 6.2Γ (L4)
Try it out π
https://huggingface.co/kernels/erikkaum/maxsim
Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.
The result is 3β5Γ speedup compared to naive PyTorch baseline π₯
Benchmarks:
- SmallRerank (B=32, C=10): up to 3.2Γ (M3 Pro) / 2.8Γ (A100)
- HeavyRerank (B=32, C=100): up to 3.8Γ (M3 Pro) / 5.3Γ (A100)
- LongDocStress (Ld=1024): up to 6.2Γ (L4)
Try it out π
https://huggingface.co/kernels/erikkaum/maxsim