Add humming instructions

95b9b99 verified about 18 hours ago

5.32 kB

	---
	license: other
	license_name: modified-mit
	license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
	library_name: transformers
	pipeline_tag: image-text-to-text
	base_model: moonshotai/Kimi-K2.5
	base_model_relation: quantized
	tags:
	- gsq
	- gumbel-softmax
	- quantization
	- ptq
	- moe
	- kimi
	- vllm
	- humming
	- arxiv:2604.18556
	---

	# Kimi-K2.5 — 2-bit GSQ

	2-bit quantization of [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
	(MoE, 384 experts, ≈260 GB FP) produced with GSQ
	(Gumbel-Softmax Quantization). The model is compressed from ≈4.5 bpp down to
	≈2.13 bpp while preserving most of the base model's reasoning, coding,
	and long-context behaviour — and slightly exceeds the FP base on MATH 500
	and LiveCodeBench v6 under our evaluation pipeline.

	- Paper: [GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling](https://arxiv.org/abs/2604.18556) (arXiv:2604.18556)
	- Paper page on HF: <https://huggingface.co/papers/2604.18556>
	- Code: <https://github.com/IST-DASLab/GSQ>
	- Collection: <https://huggingface.co/collections/ISTA-DASLab/gsq>

	## Quantization details

	- Base model: [`moonshotai/Kimi-K2.5`](https://huggingface.co/moonshotai/Kimi-K2.5)
	- Bits / weight (effective): ≈2.13 bpp
	- Codebook: 2-bit symmetric scalar `{-2, -1, 0, +1} × scale`
	- Group size: 128
	- Format: [Humming](https://github.com/inclusionAI/humming) (`quant_method: "humming"`, `b_dtype: "uint2"`)
	- Pipeline: GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
	- What's quantized: routed-expert MLPs from layer 1 onward (`gate_proj`, `up_proj`, `down_proj`). Attention (`self_attn`), layernorms, embeddings, LM head, vision tower, MM projector, MoE routing `gate`, shared experts, and the first dense MLP layer (`layers.0.mlp.*`) are kept in BF16.

	### Storage layout (why the HF UI shows I32 + BF16)

	The Hugging Face "Tensor types" widget reports the container dtype of each
	safetensor on disk, not the effective precision of the underlying weights.
	This checkpoint uses the Humming on-disk layout (exact-width packing — no
	sub-byte values are padded into a wider container). For every quantized
	expert-MLP `Linear` with original weight shape `[out_features, in_features]`,
	the following tensors are stored:

	\| Tensor \| Dtype \| Shape on disk \| Meaning \|
	\|-----------------------------------------\|-------\|-------------------------------------\|-------------------------------------------------------------------------------\|
	\| `<layer>.weight` \| I32 \| `[out_features, in_features × 2 / 32]` = `[out_features, in_features / 16]` \| 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. \|
	\| `<layer>.weight_scale` \| BF16 \| `[out_features, in_features / 128]` \| One symmetric scale per group of `group_size = 128` weights along the input dim. \|
	\| Attention / norms / embed / LM-head / vision / MM-projector / MoE `gate` / shared experts / `layers.0.mlp.*` \| BF16 \| unchanged \| Not quantized; copied from the base checkpoint. \|

	So although the UI says "I32 + BF16", the effective storage per quantized
	weight is `2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp`. The
	`quantization_config` block in `config.json` is:

	```json
	{
	"quant_method": "humming",
	"b_dtype": "uint2",
	"weight_scale_group_size": 128,
	"weight_scale_type": "group",
	"has_zero_point": false,
	"ignore": [
	"lm_head",
	"re:.embed_tokens.",
	"re:.self_attn.",
	"re:.input_layernorm.",
	"re:.post_attention_layernorm.",
	"re:.*\\.norm$",
	"re:.vision_tower.",
	"re:.mm_projector.",
	"re:.*mlp\\.gate$",
	"re:.shared_expert.",
	"re:.layers\\.(0)\\.mlp\\.(gate_proj\|up_proj\|down_proj\|gate_up_proj)."
	]
	}
	```

	Loading this checkpoint requires vLLM plus the
	[`humming`](https://github.com/inclusionAI/humming) MoE kernels (`pip install
	humming-kernels`). See Serving with vLLM below.

	> Note: GSQ training first writes shards in `compressed-tensors`
	> `pack-quantized` format (where the 2-bit codebook is padded into a 4-bit
	> INT32 container). The published checkpoint here has been re-packed via
	> `convert_to_humming.py` into exact-width 2-bit Humming storage, hence the
	> `2 / 32` shape factor on `weight`.

	## Serving with vLLM

	Install the Humming kernels (required for vLLM to load this checkpoint):

	```bash
	pip install humming-kernels
	```

	Hopper (sm_90) or Ampere (sm ≥ 80) GPUs required for serving. On 8× H100/H200,
	valid TP sizes are `1, 2, 4, 8` (Marlin MoE constraint with group size 128).

	```bash
	vllm serve ISTA-DASLab/Kimi-K2.5-2Bit-GSQ \
	--tensor-parallel-size 8 \
	--trust-remote-code
	```

	## Citation

	```bibtex
	@article{gsq2026,
	title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
	author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
	journal= {arXiv preprint arXiv:2604.18556},
	year = {2026},
	url = {https://arxiv.org/abs/2604.18556}
	}
	```