Kimi-K2.5-MTP / README.md

dhillondeep

Upload folder using huggingface_hub

27593f5 verified 4 days ago

preview code

raw

history blame contribute delete

2.74 kB

metadata

license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
base_model: moonshotai/Kimi-K2.5
tags:
  - mtp
  - speculative-decoding
  - kimi
  - deepseek
  - moe

Kimi-K2.5-MTP

Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.

What is this?

This model combines:

Base model: moonshotai/Kimi-K2.5 (1.04T/32B active MoE, INT4 compressed-tensors)
MTP layer: Layer 61 MTP weights from k-l-lambda/Kimi-K2.5-MTP (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)

The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.

Usage with vLLM

Requires modularml/vllm with MTP support for Kimi K2.5 (branch kimi-k25-mtp-780ba37).

vllm serve modularai/Kimi-K2.5-MTP \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'

Benchmark Results (8x NVIDIA B200)

Config	Output tok/s	TPOT p50 (ms)	Acceptance Rate
No speculation	947	14.07	—
MTP k=1	869	15.85	~39%

Note: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.

Architecture

Model type: kimi_k25 (VLM wrapper around DeepSeek V3 architecture)
Parameters: 1.04T total / 32B active (MoE with 384 routed experts)
MTP module: 1 full decoder layer at index 61 (~14B parameters)
- enorm + hnorm (RMSNorm) → concat → eh_proj (Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) → shared_head (RMSNorm + LM head)
Quantization: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers

License

This model uses weights from moonshotai/Kimi-K2.5 under the Modified MIT License and MTP weights from k-l-lambda/Kimi-K2.5-MTP under MIT.

Credits

Base model: Moonshot AI
MTP weights: k-l-lambda