Upload README.md with huggingface_hub

8fc2e19 verified 2 months ago

6.58 kB

	---
	base_model: moonshotai/Kimi-K2.5
	language:
	- en
	license: other
	license_name: kimi-k2.5-license
	license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
	tags:
	- quantization
	- compressed-tensors
	- gsq
	- 2-bit
	- moe
	- multimodal
	---

	# Kimi-K2.5 — 2-bit GSQ Quantization

	This is a simulated 2-bit quantized version of [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5), produced using GSQ, a learned post-training quantization method. The model weights are stored in [compressed-tensors](https://github.com/neuralmagic/compressed-tensors) format and are compatible with vLLM for inference.

	> Note — Simulated quantization: The quantization was optimized at 2-bit precision during training, but the resulting weights are serialized into a 4-bit packed integer format (`int32` with 8 values per element) via compressed-tensors. At inference time, vLLM loads and dequantizes from this 4-bit container. The weight values themselves only use 4 distinct levels (matching true 2-bit), but the on-disk and in-memory representation is 4-bit — there is no memory or storage saving beyond INT4 in this checkpoint.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) \|
	\| Architecture \| MoE multimodal LLM (DeepSeek-V3-style MoE) \|
	\| Transformer layers \| 61 \|
	\| Routed experts \| 384 (8 active per token) \|
	\| Hidden size \| 7168 \|
	\| Context length \| 262,144 tokens (256K) \|
	\| Total parameters \| ~547B \|
	\| Quantization \| 2-bit GSQ (stored as INT4-packed via compressed-tensors) \|
	\| Quantized layers \| Expert FFN weights, layers 1–60 \|
	\| Group size \| 128 \|
	\| Calibration dataset \| [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) \|
	\| Weight format \| compressed-tensors, `pack-quantized` \|
	\| Disk size \| ~511 GB \|

	## Results

	### Benchmark Results (lm-evaluation-harness)

	\| Benchmark \| Metric \| Baseline (BF16) \| GSQ 2-bit \| Δ \|
	\|---\|---\|---\|---\|---\|
	\| GSM8K \| exact_match (strict) \| 94.01 \| 92.57 \| -1.44 \|
	\| ARC-Challenge \| acc_norm \| 70.14 \| 62.97 \| -7.17 \|
	\| ARC-Easy \| acc_norm \| 88.80 \| 85.10 \| -3.70 \|
	\| PIQA \| acc_norm \| 86.29 \| 82.37 \| -3.92 \|
	\| WinoGrande \| acc \| 80.82 \| 76.95 \| -3.87 \|

	### Perplexity (WikiText-2)

	Evaluated on a 128-sample held-out split during quantization, measured every 6 layers as quantization progressed:

	\| Checkpoint \| WikiText-2 PPL \|
	\|---\|---\|
	\| Dense baseline \| 1.734 \|
	\| After layer 6 \| 1.734 \|
	\| After layer 12 \| 1.733 \|
	\| After layer 24 \| 1.733 \|
	\| After layer 36 \| 1.735 \|
	\| After layer 48 \| 1.741 \|
	\| After layer 60 (final) \| 1.749 \|

	The final 2-bit quantized model retains perplexity within 0.015 of the dense baseline (< 1% relative degradation).

	## Quantization Details

	This model was quantized using GSQ, a learned post-training quantization method. Quantization was applied independently to each transformer layer using 4,096 calibration samples of sequence length 4,096 from the [OpenThoughts](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k) dataset, with group size 128.

	Only the MoE expert feed-forward weights (`gate_proj`, `up_proj`, `down_proj`) in layers 1–60 are quantized. The following components are kept in original precision:
	- Attention projections (`self_attn`)
	- Embeddings and the LM head
	- Layer norms
	- The shared expert
	- Layer 0's dense MLP
	- All vision tower and multimodal projector weights

	## Usage

	This model requires vLLM for inference. Because Kimi-K2.5 uses a custom model architecture (`kimi_k25`), you must pass `--trust-remote-code`.

	While the MoE expert weights are quantized to 2-bit, the attention, embedding, and norm weights remain in bfloat16, so the on-disk size is ~511 GB and the model still requires substantial GPU memory. In our testing, 8× NVIDIA GH200 96 GB GPUs (2 nodes with tensor parallelism 8) are needed for serving.

	### Installation

	```bash
	pip install vllm
	```

	### Serving with vLLM

	```bash
	vllm serve daslab-testing/Kimi-K2.5-2bit-GSQ \
	--trust-remote-code \
	--tensor-parallel-size 8 \
	--distributed-executor-backend ray \
	--tokenizer-mode hf \
	--mm-encoder-tp-mode data \
	--max-model-len 4096 \
	--gpu-memory-utilization 0.85 \
	--max-num-seqs 4
	```

	Flag notes:
	- `--tokenizer-mode hf`: Required to prevent garbled output on extended serving sessions (vLLM issue [#35718](https://github.com/vllm-project/vllm/issues/35718)).
	- `--mm-encoder-tp-mode data`: Required for Kimi-K2.5's vision encoder — ViT dimensions are not evenly divisible by the tensor-parallel size, which causes cuBLAS errors without this flag.
	- `--max-model-len 4096`: Adjust upward if GPU memory permits; 4096 is what was used during our testing.
	- `--distributed-executor-backend ray`: Required for multi-node serving.

	### Offline inference with vLLM

	```python
	from vllm import LLM, SamplingParams

	llm = LLM(
	model="daslab-testing/Kimi-K2.5-2bit-GSQ",
	trust_remote_code=True,
	tensor_parallel_size=8,
	tokenizer_mode="hf",
	mm_encoder_tp_mode="data",
	max_model_len=4096,
	gpu_memory_utilization=0.85,
	)

	sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=1024)

	outputs = llm.generate(["Explain the concept of entropy in thermodynamics."], sampling_params)
	print(outputs[0].outputs[0].text)
	```

	### Chat template

	Kimi-K2.5 uses its own tokenizer and chat template. Use the tokenizer bundled with this repository:

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"daslab-testing/Kimi-K2.5-2bit-GSQ",
	trust_remote_code=True,
	)

	messages = [{"role": "user", "content": "What is 2+2?"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	```

	## Limitations

	- This is a research quantization, not a production-ready release. Expect some quality degradation relative to the full-precision model, particularly on tasks requiring precise arithmetic or complex multi-step reasoning.
	- Vision/multimodal capabilities have not been evaluated post-quantization (only the language model weights were quantized).
	- The model uses a custom architecture; some inference frameworks other than vLLM may not support it without modification.

	## License

	This model is derived from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) and is subject to the same [license terms](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE). Please review those terms before use.