File size: 4,348 Bytes
d2ead70 b3c64bd 0592f74 b3c64bd d2ead70 b3c64bd c176c3d 7c86fb3 b3c64bd 0d69595 e401deb 0d69595 b3c64bd 3966522 b3c64bd 3966522 b3c64bd 71648d6 b3c64bd 71648d6 b3c64bd 71648d6 b3c64bd 71648d6 b3c64bd e0de7da b3c64bd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- moonshotai/Kimi-K2.7-Code
---
# Model Overview
- **Model Architecture:** Kimi-K2.7-Code
- **Input:** Text, Image, Video
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.2.3
- **PyTorch:** 2.10.0
- **Transformers:** 5.12.1
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
- **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
- **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
- **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector
This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.
**Quantization script:**
```bash
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py \
--model_dir moonshotai/Kimi-K2.7-Code \
--output_dir Kimi-K2.7-Code-MXFP4 \
--file2file_quantization \
--trust_remote_code \
--quant_scheme mxfp4 \
--layer_quant_scheme '*self_attn*' ptpc_fp8 \
--exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
"*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
--model_export hf_format
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
Note: this model has 64 KV heads, which is incompatible with the AITER MLA
kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:
```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
```
## Evaluation
The model was evaluated on the GSM8K benchmark.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>Kimi-K2.7-Code</strong>
</td>
<td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GSM8K (strict-match)
</td>
<td>95.07
</td>
<td>94.80
</td>
<td>99.7%
</td>
</tr>
<tr>
<td>GSM8K (flexible-extract)
</td>
<td>95.15
</td>
<td>94.77
</td>
<td>99.6%
</td>
</tr>
</table>
GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated
stable runs (range: strict 94.39–95.60, flexible 94.39–95.53).
### Reproduction
The GSM8K results were obtained using the `lm-evaluation-harness` framework
with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model
is served first, then evaluated via the OpenAI-compatible completions API.
Important: serve with automatic prefix caching disabled
(`--no-enable-prefix-caching`) for deterministic evaluation results.
```bash
# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code --tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 --max-model-len 8192 \
--seed 42 --no-enable-prefix-caching
# 2) Evaluate
lm_eval --model local-completions \
--model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
--tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
```
# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.
|