soroushtabesh commited on
Commit
7a53d08
·
verified ·
1 Parent(s): b01579f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +59 -5
README.md CHANGED
@@ -1,12 +1,66 @@
1
  ---
2
  license: other
 
 
 
 
3
  base_model: moonshotai/Kimi-K2.5
4
  base_model_relation: quantized
5
- library_name: transformers
6
  tags:
7
- - kimi
8
- - quantized
9
- - 2-bit
10
  - gsq
11
- - multimodal
 
 
 
 
 
 
 
12
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
+ license_name: modified-mit
4
+ license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
  base_model: moonshotai/Kimi-K2.5
8
  base_model_relation: quantized
 
9
  tags:
 
 
 
10
  - gsq
11
+ - gumbel-softmax
12
+ - quantization
13
+ - ptq
14
+ - moe
15
+ - kimi
16
+ - vllm
17
+ - compressed-tensors
18
+ - arxiv:2604.18556
19
  ---
20
+
21
+ # Kimi-K2.5 — 2-bit GSQ
22
+
23
+ 2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
24
+ (MoE, 384 experts, ~260 GB FP) produced with **GSQ**
25
+ (Gumbel-Softmax Quantization). The model is compressed from ~4.5 bpp down to
26
+ **~2.13 bpp** while preserving most of the base model's reasoning, coding,
27
+ and long-context behaviour — and slightly *exceeds* the FP base on MATH 500
28
+ and LiveCodeBench v6 under our evaluation pipeline.
29
+
30
+ - Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
31
+ - Paper page on HF: <https://huggingface.co/papers/2604.18556>
32
+ - Code: <https://github.com/IST-DASLab/GSQ>
33
+ - Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>
34
+
35
+ ## Quantization details
36
+
37
+ - **Base model:** [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
38
+ - **Bits / weight (effective):** ~2.13 bpp
39
+ - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
40
+ - **Group size:** 128
41
+ - **Format:** `compressed-tensors` (auto-detected by vLLM)
42
+ - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
43
+ - **Attention projections:** kept in FP (only experts / MLPs quantized)
44
+
45
+ ## Serving with vLLM
46
+
47
+ Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
48
+ valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).
49
+
50
+ ```bash
51
+ vllm serve ISTA-DASLab/Kimi-K2.5-2Bit-GSQ \
52
+ --tensor-parallel-size 8 \
53
+ --trust-remote-code
54
+ ```
55
+
56
+ ## Citation
57
+
58
+ ```bibtex
59
+ @article{gsq2026,
60
+ title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
61
+ author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
62
+ journal= {arXiv preprint arXiv:2604.18556},
63
+ year = {2026},
64
+ url = {https://arxiv.org/abs/2604.18556}
65
+ }
66
+ ```