--- license: other license_name: modified-mit license_link: LICENSE base_model: - moonshotai/Kimi-K2.7-Code --- # Model Overview - **Model Architecture:** Kimi-K2.7-Code - **Input:** Text, Image, Video - **Output:** Text - **Supported Hardware Microarchitecture:** AMD MI350/MI355 - **ROCm:** 7.2.3 - **PyTorch:** 2.10.0 - **Transformers:** 5.12.1 - **Operating System(s):** Linux - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.12) - **Weight quantization:** OCP MXFP4, Static; self_attn Perchannel, FP8E4M3, Static - **Activation quantization:** OCP MXFP4, Dynamic; self_attn Pertoken, FP8E4M3, Dynamic - **Excluded from quantization:** MoE gates, `lm_head`, vision tower and multimodal projector This model was built with the Kimi-K2.7-Code model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization. # Model Quantization The model was quantized from [moonshotai/Kimi-K2.7-Code](https://huggingface.co/moonshotai/Kimi-K2.7-Code) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The MoE/Linear weights and activations are quantized to OCP MXFP4, while the attention projections use FP8 (E4M3). The vision tower and multimodal projector are kept at BF16. **Quantization script:** ```bash cd Quark/examples/torch/language_modeling/llm_ptq/ python3 quantize_quark.py \ --model_dir moonshotai/Kimi-K2.7-Code \ --output_dir Kimi-K2.7-Code-MXFP4 \ --file2file_quantization \ --trust_remote_code \ --quant_scheme mxfp4 \ --layer_quant_scheme '*self_attn*' ptpc_fp8 \ --exclude_layers "*lm_head*" "*mlp.gate" "*mm_projector*" \ "*vision_tower*" "mtp.*" "*shared_expert_gate*" "*router*" \ --model_export hf_format ``` # Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. Note: this model has 64 KV heads, which is incompatible with the AITER MLA kernel (supports 16 or 128 only). Disable AITER MLA when serving on ROCm: ```bash export VLLM_ROCM_USE_AITER=1 export VLLM_ROCM_USE_AITER_MLA=0 export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 export VLLM_ROCM_USE_AITER_FP4BMM=0 python3 -m vllm.entrypoints.openai.api_server \ --model amd/Kimi-K2.7-Code-MXFP4 \ --trust-remote-code \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 8192 ``` ## Evaluation The model was evaluated on the GSM8K benchmark. ### Accuracy
| Benchmark | Kimi-K2.7-Code | Kimi-K2.7-Code-MXFP4 (this model) | Recovery |
| GSM8K (strict-match) | 95.07 | 94.80 | 99.7% |
| GSM8K (flexible-extract) | 95.15 | 94.77 | 99.6% |