File size: 4,348 Bytes

d2ead70
b3c64bd
0592f74
 
b3c64bd
 
d2ead70
b3c64bd
 
c176c3d
7c86fb3
b3c64bd
 
 
 
 
 
 
 
0d69595
e401deb
0d69595
b3c64bd
 
 
 
 
 
 
3966522
b3c64bd
3966522
 
 
 
 
 
 
 
 
 
 
 
 
 
b3c64bd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71648d6
b3c64bd
71648d6
b3c64bd
 
 
 
 
 
 
71648d6
b3c64bd
71648d6
b3c64bd
 
 
 
 
 
 
e0de7da
b3c64bd

---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- moonshotai/Kimi-K2.7-Code
---
# Model Overview

- **Model Architecture:** Kimi-K2.7-Code
  - **Input:** Text, Image, Video
  - **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.2.3
- **PyTorch:** 2.10.0
- **Transformers:** 5.12.1
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
  - **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static 
  - **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
  - **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector

This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.

# Model Quantization

The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.

**Quantization script:**

```bash
cd Quark/examples/torch/language_modeling/llm_ptq/

python3 quantize_quark.py \
    --model_dir moonshotai/Kimi-K2.7-Code \
    --output_dir Kimi-K2.7-Code-MXFP4 \
    --file2file_quantization \
    --trust_remote_code \
    --quant_scheme mxfp4 \
    --layer_quant_scheme '*self_attn*' ptpc_fp8 \
    --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
        "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
    --model_export hf_format
```

# Deployment
### Use with vLLM

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.

Note: this model has 64 KV heads, which is incompatible with the AITER MLA
kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:

```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0

python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 \
    --max-model-len 8192
```

## Evaluation
The model was evaluated on the GSM8K benchmark.

### Accuracy

<table>
  <tr>
   <td><strong>Benchmark</strong>
   </td>
   <td><strong>Kimi-K2.7-Code</strong>
   </td>
   <td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong>
   </td>
   <td><strong>Recovery</strong>
   </td>
  </tr>
  <tr>
   <td>GSM8K (strict-match)
   </td>
   <td>95.07
   </td>
   <td>94.80
   </td>
   <td>99.7%
   </td>
  </tr>
  <tr>
   <td>GSM8K (flexible-extract)
   </td>
   <td>95.15
   </td>
   <td>94.77
   </td>
   <td>99.6%
   </td>
  </tr>
</table>

GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated
stable runs (range: strict 94.39–95.60, flexible 94.39–95.53).

### Reproduction

The GSM8K results were obtained using the `lm-evaluation-harness` framework
with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model
is served first, then evaluated via the OpenAI-compatible completions API.

Important: serve with automatic prefix caching disabled
(`--no-enable-prefix-caching`) for deterministic evaluation results.

```bash
# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
       VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
    --model amd/Kimi-K2.7-Code-MXFP4 \
    --trust-remote-code --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.9 --max-model-len 8192 \
    --seed 42 --no-enable-prefix-caching

# 2) Evaluate
lm_eval --model local-completions \
    --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
    --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
```

# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.