File size: 2,742 Bytes
27593f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE
base_model: moonshotai/Kimi-K2.5
tags:
  - mtp
  - speculative-decoding
  - kimi
  - deepseek
  - moe
---

# Kimi-K2.5-MTP

Kimi K2.5 with Multi-Token Prediction (MTP) speculative decoding support for vLLM.

## What is this?

This model combines:
- **Base model**: [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) (1.04T/32B active MoE, INT4 compressed-tensors)
- **MTP layer**: Layer 61 MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) (14.7 GB, INT4 compressed-tensors for experts, BF16 for projections/norms/head)

The MTP module enables speculative decoding where a lightweight prediction head drafts tokens that the main model can verify in parallel, improving throughput.

## Usage with vLLM

Requires [modularml/vllm](https://github.com/modularml/vllm) with MTP support for Kimi K2.5 (branch `kimi-k25-mtp-780ba37`).

```bash
vllm serve modularai/Kimi-K2.5-MTP \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --speculative-config '{"model":"modularai/Kimi-K2.5-MTP","method":"mtp","num_speculative_tokens":1,"use_local_argmax_reduction":true}'
```

## Benchmark Results (8x NVIDIA B200)

| Config | Output tok/s | TPOT p50 (ms) | Acceptance Rate |
|--------|-------------|---------------|-----------------|
| No speculation | 947 | 14.07 | — |
| MTP k=1 | 869 | 15.85 | ~39% |

> **Note**: The MTP acceptance rate is low (~39%) because the MTP weights were not trained directly on this base model checkpoint. With properly matched MTP weights (trained via self-distillation on this exact checkpoint), acceptance rates of 80-90% are expected, yielding ~1.5-1.8x throughput improvement.

## Architecture

- **Model type**: `kimi_k25` (VLM wrapper around DeepSeek V3 architecture)
- **Parameters**: 1.04T total / 32B active (MoE with 384 routed experts)
- **MTP module**: 1 full decoder layer at index 61 (~14B parameters)
  - `enorm` + `hnorm` (RMSNorm) → concat → `eh_proj` (Linear 2×7168 → 7168) → decoder layer (MLA attention + MoE) → `shared_head` (RMSNorm + LM head)
- **Quantization**: INT4 compressed-tensors (group_size=32) for MoE experts, BF16 for dense layers

## License

This model uses weights from [moonshotai/Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) under the [Modified MIT License](https://huggingface.co/moonshotai/Kimi-K2.5/blob/main/LICENSE) and MTP weights from [k-l-lambda/Kimi-K2.5-MTP](https://huggingface.co/k-l-lambda/Kimi-K2.5-MTP) under MIT.

## Credits

- Base model: [Moonshot AI](https://huggingface.co/moonshotai)
- MTP weights: [k-l-lambda](https://huggingface.co/k-l-lambda)