Fused PLE (35 norms -> 1): smaller graph, faster iPhone compile 4a61cd0 verified mlboydaisuke commited on 3 days ago
Multifunction chunks: SDPA decode + N=512 prefill, 50% weight dedup d445d15 verified mlboydaisuke commited on 3 days ago