ISTA-DASLab
/

Kimi-K2.5-2Bit-GSQ

Image-Text-to-Text

Mixture of Experts

Model card Files Files and versions

soroushtabesh commited on 16 days ago

Commit

7a53d08

·

verified ·

1 Parent(s): b01579f

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md +59 -5

README.md CHANGED Viewed

@@ -1,12 +1,66 @@
 ---
 license: other
 base_model: moonshotai/Kimi-K2.5
 base_model_relation: quantized
-library_name: transformers
 tags:
-  - kimi
-  - quantized
-  - 2-bit
   - gsq
-  - multimodal
 ---

 ---
 license: other
+license_name: modified-mit
+license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
+library_name: transformers
+pipeline_tag: image-text-to-text
 base_model: moonshotai/Kimi-K2.5
 base_model_relation: quantized
 tags:
   - gsq
+  - gumbel-softmax
+  - quantization
+  - ptq
+  - moe
+  - kimi
+  - vllm
+  - compressed-tensors
+  - arxiv:2604.18556
 ---
+# Kimi-K2.5 — 2-bit GSQ
+2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
+(MoE, 384 experts, ~260 GB FP) produced with **GSQ**
+(Gumbel-Softmax Quantization). The model is compressed from ~4.5 bpp down to
+**~2.13 bpp** while preserving most of the base model's reasoning, coding,
+and long-context behaviour — and slightly *exceeds* the FP base on MATH 500
+and LiveCodeBench v6 under our evaluation pipeline.
+- Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
+- Paper page on HF: <https://huggingface.co/papers/2604.18556>
+- Code: <https://github.com/IST-DASLab/GSQ>
+- Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>
+## Quantization details
+- **Base model:** [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
+- **Bits / weight (effective):** ~2.13 bpp
+- **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
+- **Group size:** 128
+- **Format:** `compressed-tensors` (auto-detected by vLLM)
+- **Pipeline:** GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
+- **Attention projections:** kept in FP (only experts / MLPs quantized)
+## Serving with vLLM
+Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
+valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).
+```bash
+vllm serve ISTA-DASLab/Kimi-K2.5-2Bit-GSQ \
+  --tensor-parallel-size 8 \
+  --trust-remote-code
+```
+## Citation
+```bibtex
+@article{gsq2026,
+  title  = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
+  author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
+  journal= {arXiv preprint arXiv:2604.18556},
+  year   = {2026},
+  url    = {https://arxiv.org/abs/2604.18556}
+}
+```