Image-Text-to-Text
Transformers
Safetensors
gsq
gumbel-softmax
quantization
ptq
Mixture of Experts
kimi
vllm
humming
Instructions to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ISTA-DASLab/Kimi-K2.5-2Bit-GSQ")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-2Bit-GSQ
- SGLang
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-DASLab/Kimi-K2.5-2Bit-GSQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ISTA-DASLab/Kimi-K2.5-2Bit-GSQ with Docker Model Runner:
docker model run hf.co/ISTA-DASLab/Kimi-K2.5-2Bit-GSQ
| license: other | |
| license_name: modified-mit | |
| license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| base_model: moonshotai/Kimi-K2.5 | |
| base_model_relation: quantized | |
| tags: | |
| - gsq | |
| - gumbel-softmax | |
| - quantization | |
| - ptq | |
| - moe | |
| - kimi | |
| - vllm | |
| - humming | |
| - arxiv:2604.18556 | |
| # Kimi-K2.5 β 2-bit GSQ | |
| 2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5) | |
| (MoE, 384 experts, β260 GB FP) produced with **GSQ** | |
| (Gumbel-Softmax Quantization). The model is compressed from β4.5 bpp down to | |
| **β2.13 bpp** while preserving most of the base model's reasoning, coding, | |
| and long-context behaviour β and slightly *exceeds* the FP base on MATH 500 | |
| and LiveCodeBench v6 under our evaluation pipeline. | |
| - Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556) | |
| - Paper page on HF: <https://huggingface.co/papers/2604.18556> | |
| - Code: <https://github.com/IST-DASLab/GSQ> | |
| - Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq> | |
| ## Quantization details | |
| - **Base model:** [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5) | |
| - **Bits / weight (effective):** β2.13 bpp | |
| - **Codebook:** 2-bit symmetric scalar `{-2, -1, 0, +1} Γ scale` | |
| - **Group size:** 128 | |
| - **Format:** [Humming](https://github.com/inclusionAI/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`) | |
| - **Pipeline:** GPTQ initialization β Gumbel-Softmax refinement (Lion optimizer) | |
| - **What's quantized:** routed-expert MLPs from layer 1 onward (`gate_proj`, `up_proj`, `down_proj`). Attention (`self_attn`), layernorms, embeddings, LM head, vision tower, MM projector, MoE routing `gate`, shared experts, and the first dense MLP layer (`layers.0.mlp.*`) are kept in BF16. | |
| ### Storage layout (why the HF UI shows I32 + BF16) | |
| The Hugging Face "Tensor types" widget reports the **container dtype** of each | |
| safetensor on disk, not the effective precision of the underlying weights. | |
| This checkpoint uses the **Humming** on-disk layout (exact-width packing β no | |
| sub-byte values are padded into a wider container). For every quantized | |
| expert-MLP `Linear` with original weight shape `[out_features, in_features]`, | |
| the following tensors are stored: | |
| | Tensor | Dtype | Shape on disk | Meaning | | |
| |-----------------------------------------|-------|-------------------------------------|-------------------------------------------------------------------------------| | |
| | `<layer>.weight` | I32 | `[out_features, in_features Γ 2 / 32]` = `[out_features, in_features / 16]` | 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. | | |
| | `<layer>.weight_scale` | BF16 | `[out_features, in_features / 128]` | One symmetric scale per group of `group_size = 128` weights along the input dim. | | |
| | Attention / norms / embed / LM-head / vision / MM-projector / MoE `gate` / shared experts / `layers.0.mlp.*` | BF16 | unchanged | Not quantized; copied from the base checkpoint. | | |
| So although the UI says "I32 + BF16", the **effective storage** per quantized | |
| weight is `2 bits (packed) + 16 bits / 128 (group scale) β 2.13 bpp`. The | |
| `quantization_config` block in `config.json` is: | |
| ```json | |
| { | |
| "quant_method": "humming", | |
| "b_dtype": "uint2", | |
| "weight_scale_group_size": 128, | |
| "weight_scale_type": "group", | |
| "has_zero_point": false, | |
| "ignore": [ | |
| "lm_head", | |
| "re:.*embed_tokens.*", | |
| "re:.*self_attn.*", | |
| "re:.*input_layernorm.*", | |
| "re:.*post_attention_layernorm.*", | |
| "re:.*\\.norm$", | |
| "re:.*vision_tower.*", | |
| "re:.*mm_projector.*", | |
| "re:.*mlp\\.gate$", | |
| "re:.*shared_expert.*", | |
| "re:.*layers\\.(0)\\.mlp\\.(gate_proj|up_proj|down_proj|gate_up_proj).*" | |
| ] | |
| } | |
| ``` | |
| Loading this checkpoint requires vLLM plus the | |
| [`humming`](https://github.com/inclusionAI/humming) MoE kernels (`pip install | |
| humming-kernels`). See **Serving with vLLM** below. | |
| > Note: GSQ training first writes shards in `compressed-tensors` | |
| > `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit | |
| > INT32 container). The published checkpoint here has been re-packed via | |
| > `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the | |
| > `2 / 32` shape factor on `weight`. | |
| ## Serving with vLLM | |
| Install the Humming kernels (required for vLLM to load this checkpoint): | |
| ```bash | |
| pip install humming-kernels | |
| ``` | |
| Hopper (sm_90) or Ampere (sm β₯ 80) GPUs required for serving. On 8Γ H100/H200, | |
| valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128). | |
| ```bash | |
| vllm serve ISTA-DASLab/Kimi-K2.5-2Bit-GSQ \ | |
| --tensor-parallel-size 8 \ | |
| --trust-remote-code | |
| ``` | |
| ## Citation | |
| ```bibtex | |
| @article{gsq2026, | |
| title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling}, | |
| author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan}, | |
| journal= {arXiv preprint arXiv:2604.18556}, | |
| year = {2026}, | |
| url = {https://arxiv.org/abs/2604.18556} | |
| } | |
| ``` | |