soroushtabesh commited on
Commit
6534400
·
verified ·
1 Parent(s): cd86edd

Add model card with GSQ paper citation (arXiv:2604.18556)

Browse files
Files changed (1) hide show
  1. README.md +56 -5
README.md CHANGED
@@ -1,12 +1,63 @@
1
  ---
2
  license: other
 
 
 
 
3
  base_model: moonshotai/Kimi-K2.6
4
  base_model_relation: quantized
5
- library_name: transformers
6
  tags:
7
- - kimi
8
- - quantized
9
- - 2-bit
10
  - gsq
11
- - multimodal
 
 
 
 
 
 
 
12
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
+ license_name: modified-mit
4
+ license_link: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE
5
+ library_name: transformers
6
+ pipeline_tag: image-text-to-text
7
  base_model: moonshotai/Kimi-K2.6
8
  base_model_relation: quantized
 
9
  tags:
 
 
 
10
  - gsq
11
+ - gumbel-softmax
12
+ - quantization
13
+ - ptq
14
+ - moe
15
+ - kimi
16
+ - vllm
17
+ - compressed-tensors
18
+ - arxiv:2604.18556
19
  ---
20
+
21
+ # Kimi-K2.6 — 2-bit GSQ
22
+
23
+ 2-bit quantization of [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)
24
+ produced with **GSQ** (Gumbel-Softmax Quantization), compressing the
25
+ trillion-parameter MoE down to **≈2.13 bpp** while keeping the symmetric
26
+ group-wise scalar format that drops into existing INT inference kernels.
27
+
28
+ - Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
29
+ - Paper page on HF: <https://huggingface.co/papers/2604.18556>
30
+ - Code: <https://github.com/IST-DASLab/GSQ>
31
+ - Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>
32
+
33
+ ## Quantization details
34
+
35
+ - **Base model:** [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)
36
+ - **Bits / weight (effective):** ≈2.13 bpp
37
+ - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
38
+ - **Group size:** 128
39
+ - **Format:** `compressed-tensors` (auto-detected by vLLM)
40
+ - **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
41
+ - **Attention projections:** kept in FP (only experts / MLPs quantized)
42
+
43
+ ## Serving with vLLM
44
+
45
+ Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving.
46
+
47
+ ```bash
48
+ vllm serve ISTA-DASLab/Kimi-K2.6-2Bit-GSQ \
49
+ --tensor-parallel-size 8 \
50
+ --trust-remote-code
51
+ ```
52
+
53
+ ## Citation
54
+
55
+ ```bibtex
56
+ @article{gsq2026,
57
+ title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
58
+ author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
59
+ journal= {arXiv preprint arXiv:2604.18556},
60
+ year = {2026},
61
+ url = {https://arxiv.org/abs/2604.18556}
62
+ }
63
+ ```