8bpw KL div

by UnstableLlama - opened Feb 14

Discussion

UnstableLlama

Feb 14

I noticed that the KL div of the 8pw is higher than the 5bpw, is that correct? That shouldn’t be.

NeuroSenko

Owner Feb 14

I noticed this anomaly as well while compiling the measurement table. I've verified the numbers against my script logs from model_diff.py across all quantization levels - the numbers are accurate.

I haven't determined the root cause yet. It's possible that running the measurements on exllama v0.0.21 or later might yield different results - I'll try to investigate this tomorrow if I can find the time.

P.S. It's great to see someone actually reviewing the measurement table! 😊

UnstableLlama

Feb 14

Thank you, I've passed this along to this discord.

P.S. It's great to have measurement tables to read! I can't believe so many people still release unmeasured quants.

turboderp

Feb 15

This is definitely not getting the right scores on v0.0.20. It looks badly broken, in fact. Testing on v0.0.22 gives more reasonable numbers:

Quant	Size (GB)	KL-div (quant, orig)	KL-div (orig, quant)	Perplexity	Top-K K=1	Top-K K=2	Top-K K=3	Top-K K=4	Top-K K=5
5.0bpw	47	0.02707323	0.02719177	7.67866605	0.9371	0.8012	0.6333	0.4678	0.3284
8.0bpw	75	0.00895185	0.00893667	7.70824983	0.9655	0.8826	0.7671	0.6404	0.5158
original	148	-	-	7.70801559	-	-	-	-	-

NeuroSenko

Owner Feb 16

Thank you for running the measurements on v0.0.22! I've also re-run all quantization levels tests on exllama v0.0.22 and can confirm the numbers look much more realistic now.

Quant	Size (GB)	KL-div (quant, orig)	KL-div (orig, quant)	Perplexity	Top-K K=1	Top-K K=2	Top-K K=3	Top-K K=4	Top-K K=5
2.0bpw	20	0.41110482	0.45863510	8.85093561	0.7669	0.4276	0.1963	0.0789	0.0296
3.0bpw	29	0.16125607	0.16561898	8.06653676	0.8536	0.5947	0.3567	0.1923	0.0960
4.0bpw	38	0.05995151	0.06084711	7.72232643	0.9079	0.7220	0.5146	0.3383	0.2098
5.0bpw	47	0.02719813	0.02733112	7.68017339	0.9376	0.8013	0.6315	0.4682	0.3279
6.0bpw	57	0.01553572	0.01543948	7.72972846	0.9522	0.8440	0.7019	0.5538	0.4183
7.0bpw	66	0.01088568	0.01090296	7.71071654	0.9611	0.8696	0.7452	0.6116	0.4822
8.0bpw	75	0.00899026	0.00897780	7.70958606	0.9652	0.8816	0.7673	0.6398	0.5159
original	148	-	-	7.70773351	-	-	-	-	-

The KL-divergence values now follow the expected pattern - 8.0bpw now shows proper values (0.0089/0.0089) that are lower than 5.0bpw (0.02719/0.02733), as expected.

Interestingly, both our measurements show 5.0bpw achieving slightly better perplexity (7.6801 vs 7.7077 original) than higher quantization levels, though the KL-div metrics are consistent with expectations.

I've updated the README with the complete v0.0.22 measurements table and added a note about the fix. The v0.0.20 results are now in a collapsed section for reference.

Thank you for the detailed investigation!

turboderp

Feb 16

Perplexity tests use a reference text as ground truth (wiki2 in this case) and measure the model's ability to predict that text in particular. KL-div by contrast uses the unquantized model's output as ground truth, measuring how well the quantized model reproduces the original's entire distribution over a reference text (wiki2 still).

The "sweet spot" for perplexity is a common artifact of calibration. The quantizer has to make decisions on rounding directions and ends up prioritizing the "important" aspects of the model as informed by the calibration dataset. You can end up with a kind of destructive interference that dampens the "unimportant" signals and shifts the distribution slightly towards the calibration dataset before it degrades too much and you end up with overall worse perplexity. Since the calibration dataset overlaps somewhat with wiki2, it's not too susprising you often get this small dip around 3-5 bpw.

NeuroSenko

Owner Feb 17

That's a really insightful explanation of the calibration effect, thank you!

This ties into something I've been experimenting with: using domain-specific calibration data to better preserve particular capabilities in the quant. For example, calibrating on prose fiction instead of wiki2 to prioritize creative text generation, or on a specific language to preserve multilingual abilities.

In my experiments I used Russian-language prose as the domain-specific calibration data (with separate texts for calibration and evaluation to avoid data leakage). With a smaller model at an aggressive bitrate (Gemma-3 12B 3.0bpw), domain-specific calibration showed a clear improvement on in-domain eval while barely affecting general performance. But with a much larger model at a comfortable bitrate (Qwen3-235B-A22B 5.5bpw), the difference was negligible - the quantization just wasn't lossy enough for calibration to matter.

So the effect seems real but only meaningful when quantization is aggressive enough to force hard trade-offs - which aligns with your explanation above.

Do you think domain-specific calibration is a viable approach for aggressive quants, or are there pitfalls I might be missing? One concern I have is that calibration datasets are relatively small, so they might not generalize well across a broad domain - e.g. a few prose samples may not be representative enough to preserve "creative writing" as a whole.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment