Kimi-K2.5-MTP

Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.

What is this?

This model combines:

  • Base model: moonshotai/Kimi-K2.5 (1.04T/32B active MoE, INT4 compressed-tensors)
  • MTP layer: Layer 61 MTP weights from k-l-lambda/Kimi-K2.5-MTP (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)

The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.

Usage with vLLM

Requires modularml/vllm with MTP support for Kimi K2.5 (branch kimi-k25-mtp-780ba37).

vllm serve modularai/Kimi-K2.5-MTP \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'

Benchmark Results (8x NVIDIA B200)

Config Output tok/s TPOT p50 (ms) Acceptance Rate
No speculation 947 14.07 โ€”
MTP k=1 869 15.85 ~39%

Note: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.

Architecture

  • Model type: kimi_k25 (VLM wrapper around DeepSeek V3 architecture)
  • Parameters: 1.04T total / 32B active (MoE with 384 routed experts)
  • MTP module: 1 full decoder layer at index 61 (~14B parameters)
    • enorm + hnorm (RMSNorm) โ†’ concat โ†’ eh_proj (Linear 2ร—7168 โ†’ 7168) โ†’ decoder layer (MLA attention + MoE) โ†’ shared_head (RMSNorm + LM head)
  • Quantization: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers

License

This model uses weights from moonshotai/Kimi-K2.5 under the Modified MIT License and MTP weights from k-l-lambda/Kimi-K2.5-MTP under MIT.

Credits

Downloads last month
249
Safetensors
Model size
1.1T params
Tensor type
F32
ยท
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for modularai/Kimi-K2.5-MTP

Finetuned
(39)
this model