SHARE YOUR PERPLEXITY RESULTS

#2
by ox-ox - opened

Just ran PPL on my Q3_K_L (110.22 GiB). Got a Final PPL of 8.2213 (+/- 0.09) on WikiText-2. It seems that going the FP8 -> F16 Master -> Q3_K_L route really paid off compared to standard quants. It beats the IQ4_XS efficiency curve while fitting perfectly in 128GB RAM at 28.7 t/s.

Exact command and details on hardware backend are important. See here for more discussions: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/3

I'll test your quant and let you know how it fares against mine.

Thanks for jumping in!

Yes, I just replied on your thread as wellβ€”I realized I was running -c 4096 while you were on -c 512, which explains why my initial PPL (8.22) looked "too good to be true" compared to the baseline.

I am currently re-running the test with your exact parameters (-c 512, -b 2048) on my M3 Max to get an apples-to-apples comparison. ETA is ~15 mins.

Having you test the file independently on your backend would be amazing. If it holds up on your rig, that validates the Q3_K_L as a solid daily driver for 128GB users!

Happy to share results, just let me know if there is a specific command that I should utilize.

yess download this and run this command inside your bin : https://drive.google.com/file/d/1wXEK2rhNEeaIL94JiYtC8LdjPC9LNk8j/view?usp=sharing
and run ./llama-perplexity
-m -m /Users/xx/llama.cpp/models/t1.gguf
-f wiki.test.raw
-c 512
-b 2048
--seed 1337
-ngl 99

perplexity

Yeah, the default mainline llama.cpp recipes can't compete with custom recipes and especially ik_llama.cpp SOTA quantization types.

Great job getting some quants out quickly though, minmaxing recipes is an exciting and unique hobby lmao... Catch you on Beaver AI holler at me!

This graph is gold. Thanks for running the benchmarks and plotting the data points!

It clearly shows the trade-off: your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.

I'm more than happy to lose the PPL battle to learn from the best. I'll definitely take you up on the invite to Beaver AI, I have a lot to learn about ik_llama recipes. See you there!

your IQ4_XS custom mix is indeed the quality king (better PPL), while my Q3_K_L sits as the "Mainline/Standard" option for those who need to save that extra ~5GB of VRAM or want to stick to the default llama.cpp builds without custom forks.

Thanks, and to be clear on a few points here again:

  1. The IQ4_XS is a standard llama.cpp compatible quant and requires no special custom forks, it just works. Custom quants are built into mainline llama.cpp.
  2. Many people are going to use this model for hybrid CPU + GPU so DRAM + VRAM. They might save ~5GB of DRAM not VRAM. Your quant does have smaller attn.* tensors relative to mine so they might see 10% faster TG with your quant given that is memory bandwidth limited in most cases. They would only save maybe ~1GB VRAM using the standard --cpu-moe flag.

I know I'm kind of pedantic sorry not sorry lol.

Cheers and thanks for your attention and keep on learning you'll be cranking out custom quants optimizing that trade-off of quality and speed across various target hardware platforms!

Pedantic is exactly what I need right now! No apology needed, his is how I learn.

  1. The Data (Final Run):
    I just finished the run with your exact parameters (-c 512, --seed 1337) on M3 Max:
    Final estimate: PPL = 8.7948 +/- 0.07100

So the final scoreboard is:

Your IQ4_XS Mix: ~8.57 PPL (Clear winner on Quality/Reasoning)

My Q3_K_L: ~8.79 PPL (+0.22 delta)

  1. The Speed/Bandwidth Theory:
    You nailed it on the bandwidth limitation. My attn.* tensors being smaller (Q3 vs your Q8) likely contributes to the ~28.7 t/s I'm seeing. It feels very snappy for a 230B model on a single machine.

Thanks for clarifying the DRAM vs VRAM distinction and the status of IQ quants in mainline. I'll update my mental model (and my repo description) accordingly.

I'm off to watch your talk and dive into the tensor-level quantization docs. Thanks for the crash course tonight!

Your value is now closer to what I measured for your: Q3_K_L 8.8377 +/- 0.07155

The backend can have an effect, which is another reason I like to measure
them all on the exact same hardware. But definitely more in-line with
what I might expect. Here are my full values click into the fold:

πŸ‘ˆ Details
[
  {
    "name": "BF16",
    "ppl": "8.3386 +/- 0.06651",
    "size": 426.060,
    "bpw": 16.003,
    "legend": "full quality",
    "comment": "",
    "skip": true
  },
  {
    "name": "Q8_0",
    "ppl": "8.3590 +/- 0.06673",
    "size": 226.431,
    "bpw": 8.505,
    "legend": "full quality",
    "comment": "should be full quality as original is fp8",
    "skip": true
  },
  {
    "name": "IQ5_K",
    "ppl": "8.4860 +/- 0.06815",
    "size": 157.771,
    "bpw": 5.926,
    "legend": "ubergarm",
    "comment": ""
  },
  {
    "name": "IQ4_XS\n(mainline compatible)",
    "ppl": "8.5702 +/- 0.06901",
    "size": 114.842,
    "bpw": 4.314,
    "legend": "ubergarm",
    "comment": "smol, with imatrix, mainline compat quant"
  },
  {
    "name": "Q3_K_L",
    "ppl": "8.8377 +/- 0.07155",
    "size": 110.215,
    "bpw": 4.140,
    "legend": "ox-ox",
    "comment": "https://huggingface.co/ox-ox/MiniMax-M2.5-GGUF"
  },
  {
    "name": "smol-IQ3_KS",
    "ppl": "8.7539 +/- 0.07075",
    "size": 87.237,
    "bpw": 3.277,
    "legend": "ubergarm",
    "comment": "full q8_0 attn.*"
  },
  {
    "name": "derp-smol-IQ3_KS",
    "ppl": "8.8293 +/- 0.07164",
    "size": 86.641,
    "bpw": 3.254,
    "legend": "unreleased",
    "comment": "iq6_k attn.*"
  },
  {
    "name": "IQ2_KS",
    "ppl": "9.6827 +/- 0.07972",
    "size": 69.800,
    "bpw": 2.622,
    "legend": "ubergarm",
    "comment": "full q8_0 attn.*"
  }
]

Awesome to see the Q3_K_L officially in the mix! Thanks for the bench and the inclusion in the table. The delta is exactly what I expected. Back to the lab to check out your talk now. Cheers!

Sign up or log in to comment