linzhao-amd's picture
Update README.md
7c86fb3 verified
|
Raw
History Blame Contribute Delete
4.35 kB
---
license: other
license_name: modified-mit
license_link: LICENSE
base_model:
- moonshotai/Kimi-K2.7-Code
---
# Model Overview
- **Model Architecture:** Kimi-K2.7-Code
- **Input:** Text, Image, Video
- **Output:** Text
- **Supported Hardware Microarchitecture:** AMD MI350/MI355
- **ROCm:** 7.2.3
- **PyTorch:** 2.10.0
- **Transformers:** 5.12.1
- **Operating System(s):** Linux
- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12)
- **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static
- **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic
- **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector
This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
# Model Quantization
The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16.
**Quantization script:**
```bash
cd Quark/examples/torch/language_modeling/llm_ptq/
python3 quantize_quark.py \
--model_dir moonshotai/Kimi-K2.7-Code \
--output_dir Kimi-K2.7-Code-MXFP4 \
--file2file_quantization \
--trust_remote_code \
--quant_scheme mxfp4 \
--layer_quant_scheme '*self_attn*' ptpc_fp8 \
--exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \
"*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \
--model_export hf_format
```
# Deployment
### Use with vLLM
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
Note: this model has 64 KV heads, which is incompatible with the AITER MLA
kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm:
```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MLA=0
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
export VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192
```
## Evaluation
The model was evaluated on the GSM8K benchmark.
### Accuracy
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>Kimi-K2.7-Code</strong>
</td>
<td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong>
</td>
<td><strong>Recovery</strong>
</td>
</tr>
<tr>
<td>GSM8K (strict-match)
</td>
<td>95.07
</td>
<td>94.80
</td>
<td>99.7%
</td>
</tr>
<tr>
<td>GSM8K (flexible-extract)
</td>
<td>95.15
</td>
<td>94.77
</td>
<td>99.6%
</td>
</tr>
</table>
GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated
stable runs (range: strict 94.39–95.60, flexible 94.39–95.53).
### Reproduction
The GSM8K results were obtained using the `lm-evaluation-harness` framework
with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model
is served first, then evaluated via the OpenAI-compatible completions API.
Important: serve with automatic prefix caching disabled
(`--no-enable-prefix-caching`) for deterministic evaluation results.
```bash
# 1) Serve
export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \
VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0
python3 -m vllm.entrypoints.openai.api_server \
--model amd/Kimi-K2.7-Code-MXFP4 \
--trust-remote-code --tensor-parallel-size 4 \
--gpu-memory-utilization 0.9 --max-model-len 8192 \
--seed 42 --no-enable-prefix-caching
# 2) Evaluate
lm_eval --model local-completions \
--model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \
--tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42
```
# License
Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.