ubergarm/MiniMax-M2.7-GGUF · 4 bpw suggestions

Apr 12

Another great quant, thanks! If it's not too much trouble, maybe maximize for 2x96 GB with max context. That might be a nice target.

Owner Apr 12

Thanks! the existing IQ5_K 157.771 GiB (5.926 BPW) will fit comfortably in 192GB VRAM with plenty of context. Generally there isn't much reason to go above this quality as it uses iq6_k for the ffn_down_exps tensors and iq5_k for the ffn_(gate|up)_up.

I'll have some perplexity/kld data soon to see the details!

Apr 12

Well then request fulfilled, thank you!

ndroidph changed discussion status to closed Apr 12

Owner Apr 12

haha sweet! perplexity on that one looks great, just ~0.1% above baseline! i'm running some KLD also now to see if it shows anything more.

Apr 12

How do you run a perplexity test? I'm curious about the unsloth quant I'm using.

Owner Apr 12

Look at the logs folder for my exact syntax (keep in mind different backends might have some offset relative to other backends e.g. CUDA vs CPU (i'm run cpu-only testing)) so be careful attempting to compare across model providers.

ask me if you have any questions

the wiki.test.raw is available here: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/blob/main/wiki.test.raw.gz (gunzip it first of course)

i have some notes and links about this spread about too

Apr 12

Does this command look correct if I test on CUDA?

llama-perplexity.exe --ctx-size 512 --batch-size 512 -f web_feed\wiki-text-raw.txt -fa 1 -ngl 999 --seed 1337 --threads 1 -m MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf

perplexity: tokenizing the input ..
perplexity: tokenization took 994.582 ms
perplexity: calculating perplexity over 552 chunks, n_ctx=512, batch_size=512, n_seq=1
perplexity: 0.54 seconds per pass - ETA 4.98 minutes
[1]4.0207,[2]5.2011,[3]5.0865,[4]5.6036,[5]5.7799,[6]6.3037,[7]6.6499,[8]7.6539,[9]8.0681,[10]8.2523,[11]8.3057,[12]8.6900,[13]8.6879,[14]8.5312,[15]8.6820,[16]8.2602,[17]8.4055,[18]8.3708,[19]8.1997,[20]7.9595,[21]7.9015,[22]7.6729,[23]7.4212,[24]7.2696,[25]6.9098,[26]6.7656,[27]6.8904,[28]6.9075,[29]6.9217,[30]6.9122,[31]6.8547,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]-nan,[40]-nan,[41]-nan,[42]-nan,[43]-nan,[44]-nan,[45]-nan,[46]-nan,[47]-nan,[48]-nan,[49]-nan,[50]-nan,[51]-nan,[52]-nan,[53]-nan,[54]-nan,[55]-nan,[56]-nan,[57]-nan,[58]-nan,[59]-nan,[60]-nan,[61]-nan,[62]-nan,[63]-nan,[64]-nan,[65]-nan,[66]-nan,[67]-nan,[68]-nan,[69]-nan,[70]-nan,[71]-nan,[72]-nan,[73]-nan,[74]-nan,[75]-nan,[76]-nan,[77]-nan,[78]-nan,[79]-nan,[80]-nan,[81]-nan,[82]-nan,[83]-nan,[84]-nan,[85]-nan,[86]-nan,[87]-nan,[88]-nan,[89]-nan,[90]-nan,[91]-nan,[92]-nan,[93]-nan,[94]-nan,[95]-nan,[96]-nan,[97]-nan,[98]-nan,[99]-nan,[100]-nan,[101]-nan,[102]-nan,[103]-nan,[104]-nan,[105]-nan,[106]-nan,[107]-nan,[108]-nan,[109]-nan,[110]-nan,[111]-nan,[112]-nan,[113]-nan,[114]-nan,[115]-nan,[116]-nan,[117]-nan,[118]-nan,[119]-nan,[120]-nan,[121]-nan,[122]-nan,[123]-nan,[124]-nan,[125]-nan,[126]-nan,[127]-nan,[128]-nan,[129]-nan,[130]-nan,[131]-nan,[132]-nan,[133]-nan,[134]-nan,[135]-nan,[136]-nan,[137]-nan,[138]-nan,[139]-nan,[140]-nan,[141]-nan,[142]-nan,[143]-nan,[144]-nan,[145]-nan,[146]-nan,[147]-nan,[148]-nan,[149]-nan,[150]-nan,[151]-nan,[152]-nan,[153]-nan,[154]-nan,[155]-nan,[156]-nan,[157]-nan,[158]-nan,[159]-nan,[160]-nan,[161]-nan,[162]-nan,[163]-nan,[164]-nan,[165]-nan,[166]-nan,[167]-nan,[168]-nan,[169]-nan,[170]-nan,[171]-nan,[172]-nan,[173]-nan,[174]-nan,[175]-nan,[176]-nan,[177]-nan,[178]-nan,[179]-nan,[180]-nan,[181]-nan,[182]-nan,[183]-nan,[184]-nan,[185]-nan,[186]-nan,[187]-nan,[188]-nan,[189]-nan,[190]-nan,[191]-nan,[192]-nan,[193]-nan,[194]-nan,[195]-nan

Well that seems to have gone off track. No big deal, just a curiosity.

Owner Apr 12

for CUDA full offload that looks about right... the numbers seem okay, what do you mean it went off track?

example for full GPU offload:

./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
   -ngl 999 \
    --seed 1337 \
    --ctx-size 512 \
    --threads 1 \
    --no-mmap

yeah should be fine... the seed doesn't matter as no sampling is done here... i just like it haha

Apr 12

It turned into nans after [32]. Trying your command also turns into nans at [32]. You might have to scroll right to see where the nan sequence begins.

Apr 12

I don't think there's anything wrong with evaluation since I can generate long sequences of 30-50k tokens in opencode. Maybe llama-perplexity has an issue there.

Owner Apr 12

ooooh ... nan is bad.. that typically indicates an numerical issue either with the backend kernels or the quant itself... i have not seen any nans on my quants running on CPU backend...

check here for ik_llama.cpp windows builds (which you might already be using?) https://github.com/Thireus/ik_llama.cpp/releases

Apr 12

It's strange. I get that after building the latest ik_llama.cpp for Windows, but also see the same error after [32] with llama.cpp. Maybe it's just a unsloth UD Q4_K_XL thing.

Owner Apr 12

You can try to run it with ik_llama.cpp and add --validate-quants to the command and it should tell you if there are blocks of 0 in the quant... i run all my quants through that before releasing... hrmm.. thanks for sharing some info, if it persists might want to let unsloth know...

Owner Apr 12

@AesSedai @bartowski

kind of interesting find in the wild here with nans showing up on 4ish BPW mainline mixes across some models...

Apr 12

Looks like ppl here is 0 around Q4 with the same nan issue. https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dbfd62ab6e80fe0c444fda

AesSedai

Apr 12

@ndroidph it partially worked looking at the log: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/blob/main/kld_data/aes_sedai/MiniMax-M2.7-Q4_K_M.md

but after the first couple of batches starting throwing nan.

Apr 12

That's similar to what I see after batch 32.

danielhanchen

Apr 16

For folks who haven't seen https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ - the NaN issue has been fixed for all our quants - Aes also re-uploaded Q4_K_M with the Q6_K fix.

Bart is still investigating, but will most likely upload in the next few days.

The issue isn't isolated to us - 10/26 (38%) of Bartowski's quants also NaN whilst 5/23 (22%) of ours NaN. So it's a widespread issue.

blk.61.ffn_down_exps overflows under Q4_K and Q5_K, so Q6_K must be used