What is your text for ppl?

#3
by ox-ox - opened

Just ran PPL on my Q3_K_L (110.22 GiB). Got a Final PPL of 8.2213 (+/- 0.09) on WikiText-2. It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants. It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

Heya, be very careful with the exact command and corpus you use when attempting to compare perplexity across various versions.

I just updated my logs/perplexity* with actual command if you want to check e.g.

  model=/mnt/raid/hf/MiniMax-M2.5-GGUF/IQ5_K/MiniMax-M2.5-IQ5_K-00001-of-00005.gguf

  numactl -N "$SOCKET" -m "$SOCKET" \
  ./build/bin/llama-perplexity \
      -m "$model" \
      -f wiki.test.raw \
      --seed 1337 \
      --ctx-size 512 \
      -ub 4096 -b 4096 \
      --numa numactl \
      --threads 96 \
      --threads-batch 128 \
      --validate-quants \
      --no-mmap

The seed doesn't actually matter as sampling isn't used here for perplexity.

You can get the wiki.test.raw file like so as described in the referenced quant cookers guide (outdated): https://github.com/ikawrakow/ik_llama.cpp/discussions/434

$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea  wiki.test.raw

It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants.

Be careful again here, what exactly do you mean by F16? There is a difference in bf16 and fp16 and fp8e4m3 in terms of dynamic range and precision. I used the mainline llama.cpp convert_hf_to_gguf.py as designed to convert the original safetensors to bf16 GGUF then quantized from that as is the best way to guarantee to clipping (which would be unlikely here though only a few original tensors are bf16 and most are fp8e4m3).

It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

Point me to your repo and I could likely test perplexity of your quant using my same rig (as there are small differences depending on backend e.g. CPU vs GPU etc). Also feel free to share your workflow and steps and if you're into cooking quants I'd suggest checking out Beaver AI discord where many quant cookers hang out: https://huggingface.co/BeaverAI

Cheers!

Thanks for the detailed breakdown! This is super helpful.

  1. The Repo:
    You can grab the quant here: https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF/blob/main/minimax-m2.5-Q3_K_L.gguf

  2. Context mismatch:
    That explains the delta! I ran my PPL test with -c 4096 (chunks 32) on Metal (M3 Max), whereas your log shows --ctx-size 512. My lower PPL (8.22) is likely benefiting from the larger context window compared to your baseline. I will re-run with --ctx-size 512 to align with your methodology and update my numbers.

  3. Workflow (FP8 -> F16):
    I used llama.cpp (b8022) to convert the safetensors. I didn't explicitly force BF16, so it likely defaulted to FP16 intermediate GGUF before quantizing to Q3_K_L. My goal was primarily to avoid the direct FP8->Quant artifacts I've seen in previous builds, and fit it into 128GB unified memory without swap.

  4. Cross-Verification:
    I would absolutely love if you could test the Q3_K_L on your rig to see how it stacks up against the IQ4_XS efficiency curve on your backend!

I'll definitely check out the Beaver AI discord. I'm an undergrad student working on low-bandwidth LLM interactions (SNEE project), so learning from the "cookers" would be gold. Thanks for the invite!

Thanks for looking more closely into the details to properly represent a fair comparison as best as we can!

I'm testing your Q3_K_L right now, and assume it isn't quite as good given the default recipes knock down attn.*, but I'll report the numbers as soon as I have them!

Yours:

llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q3_K:  249 tensors
llama_model_loader: - type q5_K:  186 tensors
llama_model_loader: - type q6_K:    1 tensors

Mine:

llama_model_loader: - type q8_0:  248 tensors <--- attn.*
llama_model_loader: - type q4_K:    1 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  186 tensors
  1. I didn't explicitly force BF16, so it likely defaulted to FP16 intermediate GGUF before quantizing to Q3_K_L. My goal was primarily to avoid the direct FP8->Quant artifacts I've seen in previous builds, and fit it into 128GB unified memory without swap.

It defaults to bf16, which is what I did. You can tell by looking at the output files you created, e.g. mine look like: MiniMax-M2.5-256x4.9B-BF16-00001-of-00010.gguf etc...

I've never done direct FP8->Quant and always go from original upcast to full bf16 (or fp16 only if original model is explicitly fp16 to prevent any clipping).

I'm on it!

Cheers!

Aha! I see your strategy: keeping attn.* at Q8_0 while squeezing the experts into IQ4_XS. That's a clever tradeoff.

My Q3_K_L is indeed the "vanilla" recipe from llama.cpp. I suspect your mix might yield better PPL thanks to the high-precision attention heads, but I'm curious if the IQ4_XS on the experts hurts the knowledge retrieval compared to the Q3_K experts in my build.

Really appreciate the deep dive into the tensor breakdown. It confirms my "Master" was indeed BF16 (good to know convert.py handles that safely by default). Standing by for your numbers!

@ox-ox

I'm curious if the IQ4_XS on the experts hurts the knowledge retrieval compared to the Q3_K experts in my build.

No, IQ4_XS is better than the q3_K you're using in the default mix probably for your ffn_(gate|up)_exps. I didn't look at exact recipe, but you are using q5_K likely for ffn_down_exps. Also the perplexity of my IQ4_XS is "better" specific to this model [caveat perplexity is not everything for sure, especially on instruct tuned models, look into KLD for more details].

My advice is ignore the recipe names like Q3_K_L and stuff, and look at the exact quantization type used for each exact tensor.

Also watch my recent talk for more information about how you can use ./build/bin/llama-quantize --help as well as grepping old closed PRs/Discussions on mainline and ik_llama.cpp for exact details. Here is the talk: https://blog.aifoundry.org/p/adventures-in-model-quantization

Cheers and good job again, I hope I'm not coming off too rough, really great you cooked these quants and are in the game now! Especiallky given you're an undergrad. Welcome, take your time and enjoy the ride!

Thanks a lot for the encouragement and the technical pointers!

You're right, I was sticking to the standard llama.cpp recipe names without digging into the specific mix for ffn_gate vs ffn_down. That explains why the IQ4_XS pulls ahead on specific tasks despite the size difference. I’ll definitely stop treating these recipes as black boxes and start looking at the per-tensor quantization types.

I've bookmarked your talk ('Adventures in Model Quantization') and I'm watching it tonight to get up to speed before jumping into the Discord. Thanks for the warm welcome ; it means a lot coming from an expert in the field

Sign up or log in to comment