Quantization was performed using exllamav3 v0.0.28 (commit ea87af6).
| Quant | Size (GB) | Actual bpw | PPL | KL-div (q→o) | KL-div (o→q) | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|---|---|---|---|---|---|---|---|---|---|---|
| 4.0bpw | 3.94 | 4.00 | 29.100 | 0.0150 | 0.0150 | 93.1% | 80.3% | 64.8% | 49.4% | 35.8% |
| 5.0bpw | 4.51 | 5.00 | 28.854 | 0.0042 | 0.0042 | 96.2% | 88.6% | 78.6% | 67.2% | 55.8% |
| 6.0bpw | 4.92 | 6.00 | 28.666 | 0.0013 | 0.0013 | 97.9% | 93.7% | 87.6% | 80.1% | 71.8% |
| 7.0bpw | 5.34 | 7.00 | 28.610 | 0.0004 | 0.0004 | 98.7% | 96.0% | 92.2% | 87.2% | 81.4% |
| 8.0bpw | 5.75 | 8.00 | 28.621 | 0.0002 | 0.0002 | 99.1% | 97.2% | 94.4% | 90.8% | 86.4% |
| original | 9.66 | 16.00 | 28.596 | — | — | — | — | — | — | — |
Metrics
- PPL (Perplexity) — how well the model predicts the next token. Lower is better. The original model's PPL is the baseline.
- KL-div (Kullback-Leibler divergence) — measures how the quant's probability distribution differs from the original. Lower is better. Shown in both directions (quant→orig, orig→quant); asymmetry indicates where the quant over/under-estimates probabilities.
- Top-K agreement — probability that the quant's top-K predicted tokens match the original's top-K. Higher is better. Top-1 is the most important (does the quant pick the same best token?), higher K values show agreement across less likely candidates.
Example
4.0bpw performance example using gradio script from origin's repo.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
