--- license: other license_name: modified-mit license_link: LICENSE base_model: - moonshotai/Kimi-K2.7-Code --- # Model Overview - **Model Architecture:** Kimi-K2.7-Code - **Input:** Text, Image, Video - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm:** 7.2.3 - **PyTorch:** 2.10.0 - **Transformers:** 5.12.1 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12) - **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static - **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic - **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization. # Model Quantization The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16. **Quantization script:** ```bash cd Quark/examples/torch/language_modeling/llm_ptq/ python3 quantize_quark.py \ --model_dir moonshotai/Kimi-K2.7-Code \ --output_dir Kimi-K2.7-Code-MXFP4 \ --file2file_quantization \ --trust_remote_code \ --quant_scheme mxfp4 \ --layer_quant_scheme '*self_attn*' ptpc_fp8 \ --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \ "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \ --model_export hf_format ``` # Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. Note: this model has 64 KV heads, which is incompatible with the AITER MLA kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm: ```bash export VLLM_ROCM_USE_AITER=1 export VLLM_ROCM_USE_AITER_MLA=0 export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 export VLLM_ROCM_USE_AITER_FP4BMM=0 python3 -m vllm.entrypoints.openai.api_server \ --model amd/Kimi-K2.7-Code-MXFP4 \ --trust-remote-code \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 ``` ## Evaluation The model was evaluated on the GSM8K benchmark. ### Accuracy
Benchmark Kimi-K2.7-Code Kimi-K2.7-Code-MXFP4 (this model) Recovery
GSM8K (strict-match) 95.07 94.80 99.7%
GSM8K (flexible-extract) 95.15 94.77 99.6%
GSM8K is 5-shot, greedy decoding. The MXFP4 numbers are the mean of repeated stable runs (range: strict 94.39–95.60, flexible 94.39–95.53). ### Reproduction The GSM8K results were obtained using the `lm-evaluation-harness` framework with the vLLM backend (`rocm/vllm-dev` nightly, vLLM `0.23.1rc1`). The model is served first, then evaluated via the OpenAI-compatible completions API. Important: serve with automatic prefix caching disabled (`--no-enable-prefix-caching`) for deterministic evaluation results. ```bash # 1) Serve export VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MLA=0 \ VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0 python3 -m vllm.entrypoints.openai.api_server \ --model amd/Kimi-K2.7-Code-MXFP4 \ --trust-remote-code --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 --max-model-len 8192 \ --seed 42 --no-enable-prefix-caching # 2) Evaluate lm_eval --model local-completions \ --model_args "model=amd/Kimi-K2.7-Code-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=128,tokenized_requests=False,max_length=8192,add_bos_token=True,seed=42,trust_remote_code=True" \ --tasks gsm8k --num_fewshot 5 --batch_size 1 --seed 42 ``` # License Modifications Copyright(c) 2026 Advanced Micro Devices, Inc. All rights reserved.