4 bpw suggestions

#1
by ndroidph - opened

Another great quant, thanks! If it's not too much trouble, maybe maximize for 2x96 GB with max context. That might be a nice target.

@ndroidph

Thanks! the existing IQ5_K 157.771 GiB (5.926 BPW) will fit comfortably in 192GB VRAM with plenty of context. Generally there isn't much reason to go above this quality as it uses iq6_k for the ffn_down_exps tensors and iq5_k for the ffn_(gate|up)_up.

I'll have some perplexity/kld data soon to see the details!

Well then request fulfilled, thank you!

ndroidph changed discussion status to closed

@ndroidph

haha sweet! perplexity on that one looks great, just ~0.1% above baseline! i'm running some KLD also now to see if it shows anything more.

How do you run a perplexity test? I'm curious about the unsloth quant I'm using.

@ndroidph

Look at the logs folder for my exact syntax (keep in mind different backends might have some offset relative to other backends e.g. CUDA vs CPU (i'm run cpu-only testing)) so be careful attempting to compare across model providers.

ask me if you have any questions

the wiki.test.raw is available here: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/blob/main/wiki.test.raw.gz (gunzip it first of course)

i have some notes and links about this spread about too

Does this command look correct if I test on CUDA?

llama-perplexity.exe --ctx-size 512 --batch-size 512 -f web_feed\wiki-text-raw.txt -fa 1 -ngl 999 --seed 1337 --threads 1 -m MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf

perplexity: tokenizing the input ..
perplexity: tokenization took 994.582 ms
perplexity: calculating perplexity over 552 chunks, n_ctx=512, batch_size=512, n_seq=1
perplexity: 0.54 seconds per pass - ETA 4.98 minutes
[1]4.0207,[2]5.2011,[3]5.0865,[4]5.6036,[5]5.7799,[6]6.3037,[7]6.6499,[8]7.6539,[9]8.0681,[10]8.2523,[11]8.3057,[12]8.6900,[13]8.6879,[14]8.5312,[15]8.6820,[16]8.2602,[17]8.4055,[18]8.3708,[19]8.1997,[20]7.9595,[21]7.9015,[22]7.6729,[23]7.4212,[24]7.2696,[25]6.9098,[26]6.7656,[27]6.8904,[28]6.9075,[29]6.9217,[30]6.9122,[31]6.8547,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]-nan,[40]-nan,[41]-nan,[42]-nan,[43]-nan,[44]-nan,[45]-nan,[46]-nan,[47]-nan,[48]-nan,[49]-nan,[50]-nan,[51]-nan,[52]-nan,[53]-nan,[54]-nan,[55]-nan,[56]-nan,[57]-nan,[58]-nan,[59]-nan,[60]-nan,[61]-nan,[62]-nan,[63]-nan,[64]-nan,[65]-nan,[66]-nan,[67]-nan,[68]-nan,[69]-nan,[70]-nan,[71]-nan,[72]-nan,[73]-nan,[74]-nan,[75]-nan,[76]-nan,[77]-nan,[78]-nan,[79]-nan,[80]-nan,[81]-nan,[82]-nan,[83]-nan,[84]-nan,[85]-nan,[86]-nan,[87]-nan,[88]-nan,[89]-nan,[90]-nan,[91]-nan,[92]-nan,[93]-nan,[94]-nan,[95]-nan,[96]-nan,[97]-nan,[98]-nan,[99]-nan,[100]-nan,[101]-nan,[102]-nan,[103]-nan,[104]-nan,[105]-nan,[106]-nan,[107]-nan,[108]-nan,[109]-nan,[110]-nan,[111]-nan,[112]-nan,[113]-nan,[114]-nan,[115]-nan,[116]-nan,[117]-nan,[118]-nan,[119]-nan,[120]-nan,[121]-nan,[122]-nan,[123]-nan,[124]-nan,[125]-nan,[126]-nan,[127]-nan,[128]-nan,[129]-nan,[130]-nan,[131]-nan,[132]-nan,[133]-nan,[134]-nan,[135]-nan,[136]-nan,[137]-nan,[138]-nan,[139]-nan,[140]-nan,[141]-nan,[142]-nan,[143]-nan,[144]-nan,[145]-nan,[146]-nan,[147]-nan,[148]-nan,[149]-nan,[150]-nan,[151]-nan,[152]-nan,[153]-nan,[154]-nan,[155]-nan,[156]-nan,[157]-nan,[158]-nan,[159]-nan,[160]-nan,[161]-nan,[162]-nan,[163]-nan,[164]-nan,[165]-nan,[166]-nan,[167]-nan,[168]-nan,[169]-nan,[170]-nan,[171]-nan,[172]-nan,[173]-nan,[174]-nan,[175]-nan,[176]-nan,[177]-nan,[178]-nan,[179]-nan,[180]-nan,[181]-nan,[182]-nan,[183]-nan,[184]-nan,[185]-nan,[186]-nan,[187]-nan,[188]-nan,[189]-nan,[190]-nan,[191]-nan,[192]-nan,[193]-nan,[194]-nan,[195]-nan

Well that seems to have gone off track. No big deal, just a curiosity.

@ndroidph

for CUDA full offload that looks about right... the numbers seem okay, what do you mean it went off track?

example for full GPU offload:

./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
   -ngl 999 \
    --seed 1337 \
    --ctx-size 512 \
    --threads 1 \
    --no-mmap

yeah should be fine... the seed doesn't matter as no sampling is done here... i just like it haha

It turned into nans after [32]. Trying your command also turns into nans at [32]. You might have to scroll right to see where the nan sequence begins.

I don't think there's anything wrong with evaluation since I can generate long sequences of 30-50k tokens in opencode. Maybe llama-perplexity has an issue there.

@ndroidph

ooooh ... nan is bad.. that typically indicates an numerical issue either with the backend kernels or the quant itself... i have not seen any nans on my quants running on CPU backend...

check here for ik_llama.cpp windows builds (which you might already be using?) https://github.com/Thireus/ik_llama.cpp/releases

It's strange. I get that after building the latest ik_llama.cpp for Windows, but also see the same error after [32] with llama.cpp. Maybe it's just a unsloth UD Q4_K_XL thing.

@ndroidph

You can try to run it with ik_llama.cpp and add --validate-quants to the command and it should tell you if there are blocks of 0 in the quant... i run all my quants through that before releasing... hrmm.. thanks for sharing some info, if it persists might want to let unsloth know...

@AesSedai @bartowski

kind of interesting find in the wild here with nans showing up on 4ish BPW mainline mixes across some models...

Looks like ppl here is 0 around Q4 with the same nan issue. https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dbfd62ab6e80fe0c444fda

@ndroidph it partially worked looking at the log: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/blob/main/kld_data/aes_sedai/MiniMax-M2.7-Q4_K_M.md

but after the first couple of batches starting throwing nan.

That's similar to what I see after batch 32.

For folks who haven't seen https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ - the NaN issue has been fixed for all our quants - Aes also re-uploaded Q4_K_M with the Q6_K fix.

Bart is still investigating, but will most likely upload in the next few days.

The issue isn't isolated to us - 10/26 (38%) of Bartowski's quants also NaN whilst 5/23 (22%) of ours NaN. So it's a widespread issue.

blk.61.ffn_down_exps overflows under Q4_K and Q5_K, so Q6_K must be used

Sign up or log in to comment