Anyone running this with M4 Max 128gb? How does it compare to 4bit quantization?
Thanks for pushing this model, I have seen in the 4bit quantization which would be my standard goto version that MiniMax M2.1 is too large to fit in 128Gb and that it still seems to have issues with the thinking templates. I am wondering if anyone has successfully tried this with 128Gb of shared memory and if there are any issues. Also what’s the tokens/s you are getting and with which infrastructure
Unsloth says they made some fixes to the chat template. Their jinja template is found at https://huggingface.co/unsloth/MiniMax-M2.1/blob/main/chat_template.jinja
Would you be able to test it with their template to see if that template solves your issue please?
This model is 100 GB in size.
You might have to type something like sudo sysctl iogpu.wired_limit_mb=117760 in the terminal to tell MacOS you want to use 115 GB for the GPU memory. You could even try 122880 for 120 GB?