ISTA-DASLab
/

Kimi-K2.6-2Bit-GSQ

@@ -1,12 +1,63 @@
 ---
 license: other
 base_model: moonshotai/Kimi-K2.6
 base_model_relation: quantized
-library_name: transformers
 tags:
-  - kimi
-  - quantized
-  - 2-bit
   - gsq
-  - multimodal
 ---

 ---
 license: other
+license_name: modified-mit
+license_link: https://huggingface.co/moonshotai/Kimi-K2.6/blob/main/LICENSE
+library_name: transformers
+pipeline_tag: image-text-to-text
 base_model: moonshotai/Kimi-K2.6
 base_model_relation: quantized
 tags:
   - gsq
+  - gumbel-softmax
+  - quantization
+  - ptq
+  - moe
+  - kimi
+  - vllm
+  - compressed-tensors
+  - arxiv:2604.18556
 ---
+# Kimi-K2.6 — 2-bit GSQ
+2-bit quantization of [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)
+produced with **GSQ** (Gumbel-Softmax Quantization), compressing the
+trillion-parameter MoE down to **≈2.13 bpp** while keeping the symmetric
+group-wise scalar format that drops into existing INT inference kernels.
+- Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
+- Paper page on HF: <https://huggingface.co/papers/2604.18556>
+- Code: <https://github.com/IST-DASLab/GSQ>
+- Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>
+## Quantization details
+- **Base model:** [`moonshotai/Kimi-K2.6`](https://huggingface.co/moonshotai/Kimi-K2.6)
+- **Bits / weight (effective):** ≈2.13 bpp
+- **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
+- **Group size:** 128
+- **Format:** `compressed-tensors` (auto-detected by vLLM)
+- **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
+- **Attention projections:** kept in FP (only experts / MLPs quantized)
+## Serving with vLLM
+Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving.
+```bash
+vllm serve ISTA-DASLab/Kimi-K2.6-2Bit-GSQ \
+  --tensor-parallel-size 8 \
+  --trust-remote-code
+```
+## Citation
+```bibtex
+@article{gsq2026,
+  title  = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
+  author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
+  journal= {arXiv preprint arXiv:2604.18556},
+  year   = {2026},
+  url    = {https://arxiv.org/abs/2604.18556}
+}
+```