Late-interaction retrieval (ColBERT / PyLate) bottlenecks on materializing the full similarity matrix. This kernel avoids it by using tiled scoring with simdgroup_matrix (Metal) and WMMA.
The result is 3β5Γ speedup compared to naive PyTorch baseline π₯
Benchmarks: - SmallRerank (B=32, C=10): up to 3.2Γ (M3 Pro) / 2.8Γ (A100) - HeavyRerank (B=32, C=100): up to 3.8Γ (M3 Pro) / 5.3Γ (A100) - LongDocStress (Ld=1024): up to 6.2Γ (L4)
Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
π§ hf-mem now splits MoE memory into base model weights, routed experts, and KV cache ποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them β‘ Active params isn't the same as memory footprint, especially for sparse architectures π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident π KV cache can still dominate depending on context length, batch size, and concurrency π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Natural Language Autoencoders: A Window into Latent Structure
I introduced a concise mathematical formulation of the P versus NP question into the SRT-NLA-AV-v1 demonstration:
P vs NP asks whether every problem whose solution can be verified in polynomial time (NP) can also be solved in polynomial time (P). Integer factorization β given N = pΒ·q where p and q are large primes (p < q) β is in NP but widely believed not to be in P.
The resulting activation verbalization (best-of-N, reranked by AR fidelity) surfaced:
βThis article originally appeared in the August 2016 edition of CACM. A new method of proving computational hardness of problems, known as multilinearization, can improve efficiency, reduce complexity and simplify proofs. In this article, I describe multilinearization and its application to several key problems, from the discrete logarithm and factoring to RSA and elliptic-curve discrete logarithms.β
What emerges is not a literal restatement, but a structured articulation of the modelβs internal associations: hardness proofs, algebraic techniques, and the cryptographic implications that orbit this foundational question in computational complexity.
The demo offers a compelling interface for exploring these latent representations.