Kimi-K2.5-MTP / README.md
dhillondeep's picture
Upload folder using huggingface_hub
27593f5 verified
metadata
license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
base_model: moonshotai/Kimi-K2.5
tags:
  - mtp
  - speculative-decoding
  - kimi
  - deepseek
  - moe

Kimi-K2.5-MTP

Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.

What is this?

This model combines:

  • Base model: moonshotai/Kimi-K2.5 (1.04T/32B active MoE, INT4 compressed-tensors)
  • MTP layer: Layer 61 MTP weights from k-l-lambda/Kimi-K2.5-MTP (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)

The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.

Usage with vLLM

Requires modularml/vllm with MTP support for Kimi K2.5 (branch kimi-k25-mtp-780ba37).

vllm serve modularai/Kimi-K2.5-MTP \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'

Benchmark Results (8x NVIDIA B200)

Config Output tok/s TPOT p50 (ms) Acceptance Rate
No speculation 947 14.07
MTP k=1 869 15.85 ~39%

Note: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.

Architecture

  • Model type: kimi_k25 (VLM wrapper around DeepSeek V3 architecture)
  • Parameters: 1.04T total / 32B active (MoE with 384 routed experts)
  • MTP module: 1 full decoder layer at index 61 (~14B parameters)
    • enorm + hnorm (RMSNorm) → concat → eh_proj (Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) → shared_head (RMSNorm + LM head)
  • Quantization: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers

License

This model uses weights from moonshotai/Kimi-K2.5 under the Modified MIT License and MTP weights from k-l-lambda/Kimi-K2.5-MTP under MIT.

Credits