| --- |
| license: other |
| license_name: modified-mit |
| license_link: LICENSE |
| base_model: |
| - moonshotai/Kimi-K2.7-Code |
| --- |
| # Model Overview |
|
|
| - **Model Architecture:** Kimi-K2.7-Code |
| - **Input:** Text, Image, Video |
| - **Output:** Text |
| - **Supported Hardware Microarchitecture:** AMD MI350/MI355 |
| - **ROCm:** 7.2.3 |
| - **PyTorch:** 2.10.0 |
| - **Transformers:** 5.12.1 |
| - **Operating System(s):** Linux |
| - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) |
| - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12) |
| - **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static |
| - **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic |
| - **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector |
|
|
| This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization. |
|
|
| # Model Quantization |
|
|
| The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16. |
|
|
| **Quantization script:** |
|
|
| ```bash |
| cd Quark/examples/torch/language_modeling/llm_ptq/ |
| |
| python3 quantize_quark.py \ |
| --model_dir moonshotai/Kimi-K2.7-Code \ |
| --output_dir Kimi-K2.7-Code-MXFP4 \ |
| --file2file_quantization \ |
| --trust_remote_code \ |
| --quant_scheme mxfp4 \ |
| --layer_quant_scheme '*self_attn*' ptpc_fp8 \ |
| --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \ |
| "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \ |
| --model_export hf_format |
| ``` |
|
|
| # Deployment |
| ### Use with vLLM |
|
|
| This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. |
|
|
| Note: this model has 64 KV heads, which is incompatible with the AITER MLA |
| kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm: |
|
|
| ```bash |
| export VLLM_ROCM_USE_AITER=1 |
| export VLLM_ROCM_USE_AITER_MLA=0 |
| export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 |
| export VLLM_ROCM_USE_AITER_FP4BMM=0 |
| |
| python3 -m vllm.entrypoints.openai.api_server \ |
| --model amd/Kimi-K2.7-Code-MXFP4 \ |
| --trust-remote-code \ |
| --tensor-parallel-size 4 \ |
| --gpu-memory-utilization 0.9 \ |
| --max-model-len 8192 |
| ``` |
|
|
| ## Evaluation |
| The model was evaluated on the GSM8K benchmark. |
|
|
| ### Accuracy |
|
|
| <table> |
| <tr> |
| <td><strong>Benchmark</strong> |
| </td> |
| <td><strong>Kimi-K2.7-Code</strong> |
| </td> |
| <td><strong>Kimi-K2.7-Code-MXFP4 (this model)</strong> |
| </td> |
| <td><strong>Recovery</strong> |
| </td> |
| </tr> |
| <tr> |
| <td>GSM8K (strict-match) |
| </td> |
| <td>95.07 |
| </td> |
| <td>94.80 |
| </td> |
| <td>99.7% |
| </td> |
| </tr> |
| <tr> |
| <td>GSM8K (flexible-extract) |
| </td> |
| <td>95.15 |
| </td> |
| <td>94.77 |
| </td> |
| <td>99.6% |
| </td> |
| </tr> |
| </table> |
|
|
| GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated |
| stable runs (range: strict 94.39–95.60, flexible 94.39–95.53). |
|
|
| ### Reproduction |
|
|
| The GSM8K results were obtained using the `lm-evaluation-harness` framework |
| with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model |
| is served first, then evaluated via the OpenAI-compatible completions API. |
|
|
| Important: serve with automatic prefix caching disabled |
| (`--no-enable-prefix-caching`) for deterministic evaluation results. |
|
|
| ```bash |
| # 1) Serve |
| export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \ |
| VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0 |
| python3 -m vllm.entrypoints.openai.api_server \ |
| --model amd/Kimi-K2.7-Code-MXFP4 \ |
| --trust-remote-code --tensor-parallel-size 4 \ |
| --gpu-memory-utilization 0.9 --max-model-len 8192 \ |
| --seed 42 --no-enable-prefix-caching |
| |
| # 2) Evaluate |
| lm_eval --model local-completions \ |
| --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \ |
| --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42 |
| ``` |
|
|
| # License |
| Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved. |
|
|