| --- |
| license: other |
| license_name: modified-mit |
| license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE |
| base_model: moonshotai/Kimi-K2.5 |
| tags: |
| - mtp |
| - speculative-decoding |
| - kimi |
| - deepseek |
| - moe |
| --- |
| |
| # Kimi-K2.5-MTP |
|
|
| Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM. |
|
|
| ## What is this? |
|
|
| This model combines: |
| - **Base model**: [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) (1.04T/32B active MoE, INT4 compressed-tensors) |
| - **MTP layer**: Layer 61 MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head) |
|
|
| The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput. |
|
|
| ## Usage with vLLM |
|
|
| Requires [modularml/vllm](https://github.com/modularml/vllm) with MTP support for Kimi K2.5 (branch `kimi-k25-mtp-780ba37`). |
|
|
| ```bash |
| vllm serve modularai/Kimi-K2.5-MTP \ |
| --tensor-parallel-size 8 \ |
| --trust-remote-code \ |
| --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}' |
| ``` |
|
|
| ## Benchmark Results (8x NVIDIA B200) |
|
|
| | Config | Output tok/s | TPOT p50 (ms) | Acceptance Rate | |
| |--------|-------------|---------------|-----------------| |
| | No speculation | 947 | 14.07 | β | |
| | MTP k=1 | 869 | 15.85 | ~39% | |
|
|
| > **Note**: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement. |
|
|
| ## Architecture |
|
|
| - **Model type**: `kimi_k25` (VLM wrapper around DeepSeek V3 architecture) |
| - **Parameters**: 1.04T total / 32B active (MoE with 384 routed experts) |
| - **MTP module**: 1 full decoder layer at index 61 (~14B parameters) |
| - `enorm` + `hnorm` (RMSNorm) β concat β `eh_proj` (Linear 2Γ7168 β 7168) β decoder layer (MLA attention + MoE) β `shared_head` (RMSNorm + LM head) |
| - **Quantization**: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers |
| |
| ## License |
| |
| This model uses weights from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) under the [Modified MIT License](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE) and MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) under MIT. |
| |
| ## Credits |
| |
| - Base model: [Moonshot AI](https://huggingface.co/moonshotai) |
| - MTP weights: [k-l-lambda](https://huggingface.co/k-l-lambda) |
| |