Problem with the Q4 quant?
I tried this quant with 0.6 temp, it failed the hexagon test and the skywriting outputs are quite bad. The non-ik quant (anikifoss
) seems way better , is there a reason for this?
I'd need a lot more info before I could comment:
- What is your CPU, RAM, and GPU configuration
- What is the git sha of the ik_llama.cpp you are running e.g.
./build/bin/llama-server --version - How did you compile it?
- What is the command you used to run it?
You could compare perplexity yourself, as I'm currently doing. I'd love to know the Final PPL of @anikifoss version to add to my charts.
Here is how to run complexity on a CPU-only rig, adjust as necessary if you are using CUDA backend as well:
wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz
export model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-IQ4_KS.gguf
numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
--seed 1337 \
-fa -fmoe \
-mla 3 \
--ctx-size 512 \
--numa numactl \
--threads 128 \
--threads-batch 192
Thanks for your care!
- Xeon 6230 X2 768gb ram, CPU only.
- Version 3802
- With w64devkit and chatgpt, i had errors, but it worked great for your R1. cmake -B build -G "MinGW Makefiles" -DGGML_CUDA=OFF -DGGML_BLAS=OFF
- llama-server ^
--ctx-size 32768 ^
-ctk q8_0 ^
-mla 3 ^
-fa ^
-amb 1024 ^
-fmoe ^
--parallel 1 ^
--threads 50 ^
--host 0.0.0.0 ^
--port 8080 ^
--alias Kimi-K2-Instruct-DQ4_K ^
--model "E:\Kimi-K2-Instruct-IQ4_KS-00001-of-00014.gguf"
- Is this a windows rig? Okay 768gb ram cpu only that is fine. 6230 is 20 physical cores and you have dual sockets? you want to reduce numa nodes as few as possible, but this might be too old for SNC=Disable in bios? what is your
numactl --hardwareshow, or whatever it is in windows? you want all the ram bandwidth in a single numa node if possible. - Can you run
git rev-parse --short HEADto get the git sha? that is the actual version number. - I'm not sure how to compile on windows, cuda off and blas off is good.
- command for CPU only try this, i'll assume you have two numa nodes, one for each CPU socket and that windows has numactl
numactl --interleave=all
llama-server ^
--numa distribute ^
--ctx-size 32768 ^
-ctk q8_0 ^
-mla 3 ^
-fa ^
-fmoe ^
--parallel 1 ^
--threads 32 ^
--threads-batch 40 ^
-rtr ^
--host 0.0.0.0 ^
--port 8080 ^
--alias Kimi-K2-Instruct-DQ4_K ^
--model "E:\Kimi-K2-Instruct-IQ4_KS-00001-of-00014.gguf"
comments
-amb 1024 ^ # <--- only need this for CUDA psure, can remove
--threads 50 ^ # <--- i adjust this to be:
--threads 32 ^ # <--- TG is memory bandwidth limited, so likely this is lower than total physical cores, play with it to find best speeds
--threads-batch 40 ^ <--- PP is CPU limited, so use number of physical cores total, not more if hyperthreading SMT enabled
-ub 4096 -b 4096 ^ # <--- larger ub/b can increase PP or instead of this try `-rtr` run-time-repack for more TG, and use --no-mmap if not -rtr for THP possible boost
Your command is not doing anything that would effect the "smarts" of the model, q8_0 is fine for kv-cache. I'd suggest pay attention to your system prompt and sampling settings are more likely to effect the quality of the output.
Good luck!
You can speed it up probably with -ub 4096 -b 4096 as long as it says n_ctx 512 and you are using the exact wiki.test.raw file that I have, the number should be comparable. But this kind of testing can take a while. You did not give your full command so I can't comment on speed optimization. Also the original question is about generation quality, not speed so.
I thought about it some more, and I must ask what is "skywriting outputs" ? Also are you using the model in English or other languages? The main difference I would expect between mine and @anikifoss models is that I use an imatrix calibration file, while his does not. The other main difference I'm noodling on is psure aniki's uses full Q8_0 for all attn/shexp/first dense ffn layer which typically are offloaded onto GPU... No one has measured that DQ4 _K perplexity yet either, so I'll do some more experimenting here too to see if this model is more sensitive to attn/shexp/dense layer given it has less attn/dense layers than DeepSeek does...
I went ahead and ran perplexity on the models myself and will eventually post some graphs. But check out the details here: https://huggingface.co/anikifoss/Kimi-K2-Instruct-DQ4_K/discussions/3
I cant run the server for so long right now, but you agree that his perpexity is lower? Skywriting was supposed to be storywriting.
but you agree that his perpexity is lower
I measured his perplexity to be lower yes!
I'm working on some more recipes now that I am aware how sensitive Kimi-K2-Instruct is to attn/shexp/blk.0.ffn quantization.
All that said, the perplexity of my quant is not much higher, so I'd still be surprised you could tell a difference. But all I can do is report the numbers I find and you can try them out!
Thanks!
I've done some updated recipes and testing now. I believe I have something with the lowest known perplexity for the given size:
IQ4_KS2.9584 +/- 0.01473554.421 GiB 4.638 BPW
This model should be a great combination of maximum accuracy while retaining good speed.
I'm not sure the best way to upload a different revisiosn, but will look into it today.
I've done some updated recipes and testing now. I believe I have something with the lowest known perplexity for the given size:
IQ4_KS2.9584 +/- 0.01473554.421 GiB 4.638 BPWThis model should be a great combination of maximum accuracy while retaining good speed.
I'm not sure the best way to upload a different revisiosn, but will look into it today.
Thanks! The main problem i had was in story writing your quant always made decisions that where totally opposing the context (e.g. A character dislike fruits juice=> the character order orange juice) The other quant did not had the problem, and when i tried the hexagon test, it failed, perhaps an other upload will set thing straight? Your Q4 R1 2805 works wonder though.
I've uploaded all the v0.2 quants including some new sizes and updated a graph showing perplexity. My new recipe IQ4_KS is benchmarking lower perplexity that the larger DQ4_K even and about as close to Q8_0 as one can expect to get without trading off more speed.
Would love to hear how you find it if you want to download another big model haha... Thanks!
I tested, undoubtedly better: the story writing is not R1, but it avoid the types of errors the other made. It still tend to forget what characters are supposed to know. Thanks for all your works!
