File size: 4,348 Bytes
d2ead70
b3c64bd
0592f74
 
b3c64bd
 
d2ead70
b3c64bd
 
c176c3d
7c86fb3
b3c64bd
 
 
 
 
 
 
 
0d69595
e401deb
0d69595
b3c64bd
 
 
 
 
 
 
3966522
b3c64bd
3966522
 
 
 
 
 
 
 
 
 
 
 
 
 
b3c64bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71648d6
b3c64bd
71648d6
b3c64bd
 
 
 
 
 
 
71648d6
b3c64bd
71648d6
b3c64bd
 
 
 
 
 
 
e0de7da
b3c64bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- moonshotai/Kimi-K2.7-Code
---
# Model Overview

- **Model Architecture:** Kimi-K2.7-Code
  - **Input:** Text, Image, Video
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.2.3
- **PyTorch:** 2.10.0
- **Transformers:** 5.12.1
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
  - **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static 
  - **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
  - **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector

This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.

**Quantization script:**

```bash
cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
    --model_dir moonshotai/Kimi-K2.7-Code \
    --output_dir Kimi-K2.7-Code-MXFP4 \
    --file2file_quantization \
    --trust_remote_code \
    --quant_scheme mxfp4 \
    --layer_quant_scheme '*self_attn*' ptpc_fp8 \
    --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
        "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
    --model_export hf_format
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

Note: this model has 64 KV heads, which is incompatible with the AITER MLA
kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:

```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0

python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192
```

## Evaluation
The model was evaluated on the GSM8K benchmark.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Kimi-K2.7-Code</strong>
   </td>
   <td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSM8K (strict-match)
   </td>
   <td>95.07
   </td>
   <td>94.80
   </td>
   <td>99.7%
   </td>
  </tr>
  <tr>
   <td>GSM8K (flexible-extract)
   </td>
   <td>95.15
   </td>
   <td>94.77
   </td>
   <td>99.6%
   </td>
  </tr>
</table>

GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated
stable runs (range: strict 94.39–95.60, flexible 94.39–95.53).

### Reproduction

The GSM8K results were obtained using the `lm-evaluation-harness` framework
with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model
is served first, then evaluated via the OpenAI-compatible completions API.

Important: serve with automatic prefix caching disabled
(`--no-enable-prefix-caching`) for deterministic evaluation results.

```bash
# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
       VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 --max-model-len 8192 \
    --seed 42 --no-enable-prefix-caching

# 2) Evaluate
lm_eval --model local-completions \
    --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
    --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
```

# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.