MiniMax-M2.5 MLX q3/q4 uploaded
MiniMax just dropped M2.5 today (229B MoE, 10B active params, 80.2% SWE-Bench Verified) and I quantized it to MLX within a couple hours of release. First time doing this!
My quants:
Other MLX quants out there:
- mlx-community/MiniMax-M2.5-4bit -- the bot got there a few minutes before me lol
- inferencerlabs 6.5-bit and 9-bit -- higher quality quants using modified MLX
Performance on M3 Ultra 512GB:
- ~53 tokens/sec generation (4-bit)
- ~54 tokens/sec generation (3-bit)
- ~128GB peak memory (4-bit), ~100GB peak memory (3-bit)
Quality note on the 3-bit: inferencerlabs' testing shows significant quality degradation below 4 bits for this model (43% token accuracy at q3.5 vs 91%+ at q4.5). I tested my 3-bit on coding and reasoning tasks and it produced coherent, correct output, but it's definitely not as sharp as 4-bit. Think of it as the smallest viable quant for people who can't fit the 4-bit version. 2-bit was completely unusable (infinite repetition loops).
Converted with mlx-lm v0.30.7. Happy to answer questions if anyone else wants to try running this locally - cheers!
I don't see your actual quants in your repo, are they in different branches or something?
Also, what is the exact recipe you used? I'm not as familiar with mlx-lm, but in my own testing with GGUFs using mainline and ik_llama.cpp the smol-IQ3_KS 87.237 GiB (3.277 BPW) seems to be working with opencode okay at least anecdotally. Do you have a link for inferencerlabs' testing as I'm curious if it is possible to reproduce for a given quant.
Cheers!