4bpw request =)

#2
by BahamutRU - opened

16 GB VRAM + 128 GB RAM

or

128 GB shared RAM

Maybe 125-126 GB max size? 120-121?
Or, maybe, IQ4_NL?

mainline be great, ah-ah. ^_^'

At this moment I use Q4_K_S from unsloth, but quality… wish it were better.

Yes, this is probably a good size. That extra 16GB VRAM can make a difference with this model, as my previous mainline quant was mainline-IQ4_NL 121.234 GiB (4.554 BPW) so just too big for the folks with 128gb strix halo or dgx spark..

That UD-Q4_K_S is probably around 122GiB...

It will be difficult to shave a little more off to fit in under 128GB total, but I think you'll be fine with the 16GB VRAM extra. Probably can't pump -ub though and might have to use stuff like -khad -ctk q8_0 -ctv q6_0 to fit enough kv-cache... -vhad is still being fixed here: https://github.com/ikawrakow/ik_llama.cpp/pull/1625

Okay, I'll finish up getting KLD data this time as the PPL data was wonky on 2.5. This will help me fiddle with something in this ballpark!

This one at 117GB was my favorite M2.5 quant for 96 GB RAM + 32 GB VRAM, hoping to see it again with M2.7:

https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/tree/main/smol-IQ4_KSS

(Not much room left for context but the model is kinda slow with this setup anyway long context isn't really feasible anyway)

Thanks for all those great quants! 🙏

I used MiniMax-M2.5-GGUF IQ4_NL 121.386 GiB (4.559 BPW) as my daily coding driver the last few months.
So, I'd be happy to see nice ik quant in that range 😊

I would second a mainline quant too, if it's not too much trouble.

@ndroidph

Keep your eyes peeled for AesSedai's mainline quants which are similar recipes as mine using mainline compatible types: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/tree/main

I'll likely upload 2 more then, something that will fit in under 128GB and something that will need a little more than 128GB.

Very nice, thanks!

AesSedai's mainline quants

I'm waiting for 12 hours!.. ='D

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

What the speed like on that machine? pp/tg

@BahamutRU they're uploading now and should be available within an hour!

Okay, I added the smol-IQ4_KSS which looks good on the KLD graph, but seems wonky on the PPL graph. The perplexity of this and previous M2.5 model was kinda wonky with some quants scoring "better" than baseline, hence showing KLD as well this time.

I gotta take a break, not gonna release another one just yet and hopefully folks can find the mainline quants they want from Aes for now.

Also be careful, some 4ish BPW UD quants are throwing nan in the perplexity test: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/1#69dbf578f8841cf541647480

I've tested all mine here look good with ik_llama.cpp's llama-server --validate-quants and run a full clean perplexity/kld on them with no nans!

Holler at me later if you still can't find the exact iq4_nl quant etc, i might do one more, but some of my own early testing suggests Qwen3.5 is still pretty strong for smaller sizes: https://huggingface.co/ubergarm/MiniMax-M2.7-GGUF/discussions/3#69dc0e13a36186081f3e4b4a

Did a little write-up on r/LocalLLaMA comparing MiniMax-M2.7 vs Qwen3.5-122B for 96GB VRAM situation: https://www.reddit.com/r/LocalLLaMA/comments/1sjsokz/minimaxm27_vs_qwen35122ba10b_for_96gb_vram_full/

FWIW, I have been using your m2.5 IQ4_XS (115G) on a Strix Halo / 128GB. I am using it headless, though, with a bunch of other optimizations (TurboQuant, etc.) Would love to see m2.7 in the same :)

What the speed like on that machine? pp/tg

I posted some stuff about MiniMax M2.7 - @AesSedai 'sIQ4_XS: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dc57e4bc88172c7dbbc256

Sign up or log in to comment