4 bpw suggestions
Another great quant, thanks! If it's not too much trouble, maybe maximize for 2x96 GB with max context. That might be a nice target.
Thanks! the existing IQ5_K 157.771 GiB (5.926 BPW) will fit comfortably in 192GB VRAM with plenty of context. Generally there isn't much reason to go above this quality as it uses iq6_k for the ffn_down_exps tensors and iq5_k for the ffn_(gate|up)_up.
I'll have some perplexity/kld data soon to see the details!
Well then request fulfilled, thank you!
How do you run a perplexity test? I'm curious about the unsloth quant I'm using.
Look at the logs folder for my exact syntax (keep in mind different backends might have some offset relative to other backends e.g. CUDA vs CPU (i'm run cpu-only testing)) so be careful attempting to compare across model providers.
ask me if you have any questions
the wiki.test.raw is available here: https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/blob/main/wiki.test.raw.gz (gunzip it first of course)
i have some notes and links about this spread about too
Does this command look correct if I test on CUDA?
llama-perplexity.exe --ctx-size 512 --batch-size 512 -f web_feed\wiki-text-raw.txt -fa 1 -ngl 999 --seed 1337 --threads 1 -m MiniMax-M2.7-UD-Q4_K_XL-00001-of-00004.gguf
perplexity: tokenizing the input ..
perplexity: tokenization took 994.582 ms
perplexity: calculating perplexity over 552 chunks, n_ctx=512, batch_size=512, n_seq=1
perplexity: 0.54 seconds per pass - ETA 4.98 minutes
[1]4.0207,[2]5.2011,[3]5.0865,[4]5.6036,[5]5.7799,[6]6.3037,[7]6.6499,[8]7.6539,[9]8.0681,[10]8.2523,[11]8.3057,[12]8.6900,[13]8.6879,[14]8.5312,[15]8.6820,[16]8.2602,[17]8.4055,[18]8.3708,[19]8.1997,[20]7.9595,[21]7.9015,[22]7.6729,[23]7.4212,[24]7.2696,[25]6.9098,[26]6.7656,[27]6.8904,[28]6.9075,[29]6.9217,[30]6.9122,[31]6.8547,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]-nan,[40]-nan,[41]-nan,[42]-nan,[43]-nan,[44]-nan,[45]-nan,[46]-nan,[47]-nan,[48]-nan,[49]-nan,[50]-nan,[51]-nan,[52]-nan,[53]-nan,[54]-nan,[55]-nan,[56]-nan,[57]-nan,[58]-nan,[59]-nan,[60]-nan,[61]-nan,[62]-nan,[63]-nan,[64]-nan,[65]-nan,[66]-nan,[67]-nan,[68]-nan,[69]-nan,[70]-nan,[71]-nan,[72]-nan,[73]-nan,[74]-nan,[75]-nan,[76]-nan,[77]-nan,[78]-nan,[79]-nan,[80]-nan,[81]-nan,[82]-nan,[83]-nan,[84]-nan,[85]-nan,[86]-nan,[87]-nan,[88]-nan,[89]-nan,[90]-nan,[91]-nan,[92]-nan,[93]-nan,[94]-nan,[95]-nan,[96]-nan,[97]-nan,[98]-nan,[99]-nan,[100]-nan,[101]-nan,[102]-nan,[103]-nan,[104]-nan,[105]-nan,[106]-nan,[107]-nan,[108]-nan,[109]-nan,[110]-nan,[111]-nan,[112]-nan,[113]-nan,[114]-nan,[115]-nan,[116]-nan,[117]-nan,[118]-nan,[119]-nan,[120]-nan,[121]-nan,[122]-nan,[123]-nan,[124]-nan,[125]-nan,[126]-nan,[127]-nan,[128]-nan,[129]-nan,[130]-nan,[131]-nan,[132]-nan,[133]-nan,[134]-nan,[135]-nan,[136]-nan,[137]-nan,[138]-nan,[139]-nan,[140]-nan,[141]-nan,[142]-nan,[143]-nan,[144]-nan,[145]-nan,[146]-nan,[147]-nan,[148]-nan,[149]-nan,[150]-nan,[151]-nan,[152]-nan,[153]-nan,[154]-nan,[155]-nan,[156]-nan,[157]-nan,[158]-nan,[159]-nan,[160]-nan,[161]-nan,[162]-nan,[163]-nan,[164]-nan,[165]-nan,[166]-nan,[167]-nan,[168]-nan,[169]-nan,[170]-nan,[171]-nan,[172]-nan,[173]-nan,[174]-nan,[175]-nan,[176]-nan,[177]-nan,[178]-nan,[179]-nan,[180]-nan,[181]-nan,[182]-nan,[183]-nan,[184]-nan,[185]-nan,[186]-nan,[187]-nan,[188]-nan,[189]-nan,[190]-nan,[191]-nan,[192]-nan,[193]-nan,[194]-nan,[195]-nan
Well that seems to have gone off track. No big deal, just a curiosity.
for CUDA full offload that looks about right... the numbers seem okay, what do you mean it went off track?
example for full GPU offload:
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
-ngl 999 \
--seed 1337 \
--ctx-size 512 \
--threads 1 \
--no-mmap
yeah should be fine... the seed doesn't matter as no sampling is done here... i just like it haha
It turned into nans after [32]. Trying your command also turns into nans at [32]. You might have to scroll right to see where the nan sequence begins.
I don't think there's anything wrong with evaluation since I can generate long sequences of 30-50k tokens in opencode. Maybe llama-perplexity has an issue there.
ooooh ... nan is bad.. that typically indicates an numerical issue either with the backend kernels or the quant itself... i have not seen any nans on my quants running on CPU backend...
check here for ik_llama.cpp windows builds (which you might already be using?) https://github.com/Thireus/ik_llama.cpp/releases
It's strange. I get that after building the latest ik_llama.cpp for Windows, but also see the same error after [32] with llama.cpp. Maybe it's just a unsloth UD Q4_K_XL thing.
You can try to run it with ik_llama.cpp and add --validate-quants to the command and it should tell you if there are blocks of 0 in the quant... i run all my quants through that before releasing... hrmm.. thanks for sharing some info, if it persists might want to let unsloth know...
kind of interesting find in the wild here with nans showing up on 4ish BPW mainline mixes across some models...
Looks like ppl here is 0 around Q4 with the same nan issue. https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/discussions/1#69dbfd62ab6e80fe0c444fda
@ndroidph it partially worked looking at the log: https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF/blob/main/kld_data/aes_sedai/MiniMax-M2.7-Q4_K_M.md
but after the first couple of batches starting throwing nan.
That's similar to what I see after batch 32.
For folks who haven't seen https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ - the NaN issue has been fixed for all our quants - Aes also re-uploaded Q4_K_M with the Q6_K fix.
Bart is still investigating, but will most likely upload in the next few days.
The issue isn't isolated to us - 10/26 (38%) of Bartowski's quants also NaN whilst 5/23 (22%) of ours NaN. So it's a widespread issue.
blk.61.ffn_down_exps overflows under Q4_K and Q5_K, so Q6_K must be used