modularai
/

Kimi-K2.5-MTP

speculative-decoding

Mixture of Experts

Model card Files Files and versions

Kimi-K2.5-MTP / README.md

dhillondeep's picture

Upload folder using huggingface_hub

27593f5 verified 4 days ago

|

history blame contribute delete

2.74 kB

	---
	license: other
	license_name: modified-mit
	license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
	base_model: moonshotai/Kimi-K2.5
	tags:
	- mtp
	- speculative-decoding
	- kimi
	- deepseek
	- moe
	---

	# Kimi-K2.5-MTP

	Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.

	## What is this?

	This model combines:
	- Base model: [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) (1.04T/32B active MoE, INT4 compressed-tensors)
	- MTP layer: Layer 61 MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)

	The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.

	## Usage with vLLM

	Requires [modularml/vllm](https://github.com/modularml/vllm) with MTP support for Kimi K2.5 (branch `kimi-k25-mtp-780ba37`).

	```bash
	vllm serve modularai/Kimi-K2.5-MTP \
	--tensor-parallel-size 8 \
	--trust-remote-code \
	--speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'
	```

	## Benchmark Results (8x NVIDIA B200)

	\| Config \| Output tok/s \| TPOT p50 (ms) \| Acceptance Rate \|
	\|--------\|-------------\|---------------\|-----------------\|
	\| No speculation \| 947 \| 14.07 \| — \|
	\| MTP k=1 \| 869 \| 15.85 \| ~39% \|

	> Note: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.

	## Architecture

	- Model type: `kimi_k25` (VLM wrapper around DeepSeek V3 architecture)
	- Parameters: 1.04T total / 32B active (MoE with 384 routed experts)
	- MTP module: 1 full decoder layer at index 61 (~14B parameters)
	- `enorm` + `hnorm` (RMSNorm) → concat → `eh_proj` (Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) → `shared_head` (RMSNorm + LM head)
	- Quantization: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers

	## License

	This model uses weights from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) under the [Modified MIT License](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE) and MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) under MIT.

	## Credits

	- Base model: [Moonshot AI](https://huggingface.co/moonshotai)
	- MTP weights: [k-l-lambda](https://huggingface.co/k-l-lambda)