QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Paper
• 2307.13304 • Published
• 2
QuIP# (E8P12 lattice codebook) 2-bit quantization of speakleash/Bielik-11B-v2.3-Instruct.
| Attribute | Value |
|---|---|
| Base model | speakleash/Bielik-11B-v2.3-Instruct |
| Architecture | Mistral (50 layers, 4096 hidden, 32 heads, 8 KV heads) |
| Quantization method | QuIP# with E8P12 lattice codebook |
| Precision | 2-bit weights (FP16 base) |
| Model size | 3.26 GB (vs ~22 GB FP16, ~6.7x compression) |
| Calibration | CulturaX-PL (512 samples, 4096 tokens each) |
Evaluated on 22/23 tasks from the SpeakLeash Open PL LLM Leaderboard (eq_bench excluded due to private dataset).
| Metric | Score |
|---|---|
| Normalized avg (22 tasks) | 61.10 |
| FP16 baseline | 65.71 |
| Retention | ~93% of FP16 quality |
Full per-task results: Jakubrd4/bielik-q2-sharp
| Metric | QuIP# E8P12 | IQ2_XXS | FP16 |
|---|---|---|---|
| Raw avg (22 tasks) | 71.92 | 72.07 | 75.40 |
| Tasks won (head-to-head) | 11/22 | 11/22 | — |
QuIP# achieves parity with llama.cpp IQ2_XXS on 22 Polish benchmarks (delta -0.15%).
Requires quip-sharp for inference:
from lib.utils.unsafe_import import model_from_hf_path
model, tokenizer = model_from_hf_path(
"Jakubrd4/Bielik-11B-v2.3-Instruct-QuIP-2bit"
)
Note: Bielik uses Mistral architecture. QuIP# expects LlamaConfig, so a patch
in model_from_hf_path() is needed to convert MistralConfig to LlamaConfig
(map sliding_window -> None, attention_dropout -> 0).
hessians/ directory)Base model
speakleash/Bielik-11B-v2.3-Instruct