license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
base_model: moonshotai/Kimi-K2.5
tags:
- mtp
- speculative-decoding
- kimi
- deepseek
- moe
Kimi-K2.5-MTP
Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.
What is this?
This model combines:
- Base model: moonshotai/Kimi-K2.5 (1.04T/32B active MoE, INT4 compressed-tensors)
- MTP layer: Layer 61 MTP weights from k-l-lambda/Kimi-K2.5-MTP (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)
The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.
Usage with vLLM
Requires modularml/vllm with MTP support for Kimi K2.5 (branch kimi-k25-mtp-780ba37).
vllm serve modularai/Kimi-K2.5-MTP \
--tensor-parallel-size 8 \
--trust-remote-code \
--speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'
Benchmark Results (8x NVIDIA B200)
| Config | Output tok/s | TPOT p50 (ms) | Acceptance Rate |
|---|---|---|---|
| No speculation | 947 | 14.07 | — |
| MTP k=1 | 869 | 15.85 | ~39% |
Note: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.
Architecture
- Model type:
kimi_k25(VLM wrapper around DeepSeek V3 architecture) - Parameters: 1.04T total / 32B active (MoE with 384 routed experts)
- MTP module: 1 full decoder layer at index 61 (~14B parameters)
enorm+hnorm(RMSNorm) → concat →eh_proj(Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) →shared_head(RMSNorm + LM head)
- Quantization: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers
License
This model uses weights from moonshotai/Kimi-K2.5 under the Modified MIT License and MTP weights from k-l-lambda/Kimi-K2.5-MTP under MIT.
Credits
- Base model: Moonshot AI
- MTP weights: k-l-lambda