--- license: other license_name: modified-mit license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE base_model: moonshotai/Kimi-K2.5 tags: - mtp - speculative-decoding - kimi - deepseek - moe --- # Kimi-K2.5-MTP Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM. ## What is this? This model combines: - **Base model**: [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) (1.04T/32B active MoE, INT4 compressed-tensors) - **MTP layer**: Layer 61 MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head) The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput. ## Usage with vLLM Requires [modularml/vllm](https://github.com/modularml/vllm) with MTP support for Kimi K2.5 (branch `kimi-k25-mtp-780ba37`). ```bash vllm serve modularai/Kimi-K2.5-MTP \ --tensor-parallel-size 8 \ --trust-remote-code \ --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}' ``` ## Benchmark Results (8x NVIDIA B200) | Config | Output tok/s | TPOT p50 (ms) | Acceptance Rate | |--------|-------------|---------------|-----------------| | No speculation | 947 | 14.07 | — | | MTP k=1 | 869 | 15.85 | ~39% | > **Note**: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement. ## Architecture - **Model type**: `kimi_k25` (VLM wrapper around DeepSeek V3 architecture) - **Parameters**: 1.04T total / 32B active (MoE with 384 routed experts) - **MTP module**: 1 full decoder layer at index 61 (~14B parameters) - `enorm` + `hnorm` (RMSNorm) → concat → `eh_proj` (Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) → `shared_head` (RMSNorm + LM head) - **Quantization**: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers ## License This model uses weights from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) under the [Modified MIT License](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE) and MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) under MIT. ## Credits - Base model: [Moonshot AI](https://huggingface.co/moonshotai) - MTP weights: [k-l-lambda](https://huggingface.co/k-l-lambda)