Problem with the Q4 quant?

by Nark103 - opened Jul 17, 2025

Jul 17, 2025

•

edited Jul 17, 2025

I tried this quant with 0.6 temp, it failed the hexagon test and the skywriting outputs are quite bad. The non-ik quant (anikifoss
) seems way better , is there a reason for this?

ubergarm

Owner Jul 17, 2025

@Nark103

I'd need a lot more info before I could comment:

What is your CPU, RAM, and GPU configuration
What is the git sha of the ik_llama.cpp you are running e.g. ./build/bin/llama-server --version
How did you compile it?
What is the command you used to run it?

You could compare perplexity yourself, as I'm currently doing. I'd love to know the Final PPL of @anikifoss version to add to my charts.

Here is how to run complexity on a CPU-only rig, adjust as necessary if you are using CUDA backend as well:

wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
gunzip wiki.test.raw.gz

export model=/mnt/raid/models/ubergarm/Kimi-K2-Instruct-GGUF/Kimi-K2-Instruct-IQ4_KS.gguf
numactl -N 1 -m 1 \
./build/bin/llama-perplexity \
    -m "$model" \
    -f wiki.test.raw \
    --seed 1337 \
    -fa -fmoe \
    -mla 3 \
    --ctx-size 512 \
    --numa numactl \
    --threads 128 \
    --threads-batch 192

Nark103

Jul 18, 2025

Thanks for your care!

Xeon 6230 X2 768gb ram, CPU only.
Version 3802
With w64devkit and chatgpt, i had errors, but it worked great for your R1. cmake -B build -G "MinGW Makefiles" -DGGML_CUDA=OFF -DGGML_BLAS=OFF
llama-server ^
--ctx-size 32768 ^
-ctk q8_0 ^
-mla 3 ^
-fa ^
-amb 1024 ^
-fmoe ^
--parallel 1 ^
--threads 50 ^
--host 0.0.0.0 ^
--port 8080 ^
--alias Kimi-K2-Instruct-DQ4_K ^
--model "E:\Kimi-K2-Instruct-IQ4_KS-00001-of-00014.gguf"

ubergarm

Owner Jul 18, 2025

@Nark103

Is this a windows rig? Okay 768gb ram cpu only that is fine. 6230 is 20 physical cores and you have dual sockets? you want to reduce numa nodes as few as possible, but this might be too old for SNC=Disable in bios? what is your numactl --hardware show, or whatever it is in windows? you want all the ram bandwidth in a single numa node if possible.
Can you run git rev-parse --short HEAD to get the git sha? that is the actual version number.
I'm not sure how to compile on windows, cuda off and blas off is good.
command for CPU only try this, i'll assume you have two numa nodes, one for each CPU socket and that windows has numactl

numactl --interleave=all
llama-server ^
--numa distribute ^
--ctx-size 32768 ^
-ctk q8_0 ^
-mla 3 ^
-fa ^
-fmoe ^
--parallel 1 ^
--threads 32 ^
--threads-batch 40 ^
-rtr ^
--host 0.0.0.0 ^
--port 8080 ^
--alias Kimi-K2-Instruct-DQ4_K ^
--model "E:\Kimi-K2-Instruct-IQ4_KS-00001-of-00014.gguf"

comments

-amb 1024 ^ # <--- only need this for CUDA psure, can remove
--threads 50 ^ # <--- i adjust this to be:
--threads 32 ^ # <--- TG is memory bandwidth limited, so likely this is lower than total physical cores, play with it to find best speeds
--threads-batch 40 ^ <--- PP is CPU limited, so use number of physical cores total, not more if hyperthreading SMT enabled
-ub 4096 -b 4096 ^ # <--- larger ub/b can increase PP or instead of this try `-rtr` run-time-repack for more TG, and use --no-mmap if not -rtr for THP possible boost

Your command is not doing anything that would effect the "smarts" of the model, q8_0 is fine for kv-cache. I'd suggest pay attention to your system prompt and sampling settings are more likely to effect the quality of the output.

Good luck!

Nark103

Jul 18, 2025

Yes windows, dual cpu, numacl doesn't work in win AFAIK.
13b2f193
Ok, the thing is that your quant of R12805 does wonder, so it is strange that this one seems so underwhelming.
I tried the perplexity, it output that. Do i must let it run for 6 hours?

ubergarm

Owner Jul 18, 2025

•

edited Jul 18, 2025

@Nark103

You can speed it up probably with -ub 4096 -b 4096 as long as it says n_ctx 512 and you are using the exact wiki.test.raw file that I have, the number should be comparable. But this kind of testing can take a while. You did not give your full command so I can't comment on speed optimization. Also the original question is about generation quality, not speed so.

I thought about it some more, and I must ask what is "skywriting outputs" ? Also are you using the model in English or other languages? The main difference I would expect between mine and @anikifoss models is that I use an imatrix calibration file, while his does not. The other main difference I'm noodling on is psure aniki's uses full Q8_0 for all attn/shexp/first dense ffn layer which typically are offloaded onto GPU... No one has measured that DQ4 _K perplexity yet either, so I'll do some more experimenting here too to see if this model is more sensitive to attn/shexp/dense layer given it has less attn/dense layers than DeepSeek does...

ubergarm

Owner Jul 19, 2025

@Nark103

I went ahead and ran perplexity on the models myself and will eventually post some graphs. But check out the details here: https://huggingface.co/anikifoss/Kimi-K2-Instruct-DQ4_K/discussions/3

Nark103

Jul 19, 2025

I cant run the server for so long right now, but you agree that his perpexity is lower? Skywriting was supposed to be storywriting.

ubergarm

Owner Jul 19, 2025

but you agree that his perpexity is lower

I measured his perplexity to be lower yes!

I'm working on some more recipes now that I am aware how sensitive Kimi-K2-Instruct is to attn/shexp/blk.0.ffn quantization.

All that said, the perplexity of my quant is not much higher, so I'd still be surprised you could tell a difference. But all I can do is report the numbers I find and you can try them out!

Thanks!

ubergarm

Owner Jul 19, 2025

@Nark103

I've done some updated recipes and testing now. I believe I have something with the lowest known perplexity for the given size:

IQ4_KS 2.9584 +/- 0.01473 554.421 GiB 4.638 BPW

This model should be a great combination of maximum accuracy while retaining good speed.

I'm not sure the best way to upload a different revisiosn, but will look into it today.

Nark103

Jul 19, 2025

@Nark103

I've done some updated recipes and testing now. I believe I have something with the lowest known perplexity for the given size:

IQ4_KS 2.9584 +/- 0.01473 554.421 GiB 4.638 BPW

This model should be a great combination of maximum accuracy while retaining good speed.

I'm not sure the best way to upload a different revisiosn, but will look into it today.

Thanks! The main problem i had was in story writing your quant always made decisions that where totally opposing the context (e.g. A character dislike fruits juice=> the character order orange juice) The other quant did not had the problem, and when i tried the hexagon test, it failed, perhaps an other upload will set thing straight? Your Q4 R1 2805 works wonder though.

ubergarm

Owner Jul 21, 2025

@Nark103

I've uploaded all the v0.2 quants including some new sizes and updated a graph showing perplexity. My new recipe IQ4_KS is benchmarking lower perplexity that the larger DQ4_K even and about as close to Q8_0 as one can expect to get without trading off more speed.

Would love to hear how you find it if you want to download another big model haha... Thanks!

Nark103

Jul 22, 2025

•

edited Jul 22, 2025

I tested, undoubtedly better: the story writing is not R1, but it avoid the types of errors the other made. It still tend to forget what characters are supposed to know. Thanks for all your works!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment