amd
/

Kimi-K2-Instruct-0905-MXFP4

8-bit precision

Model card Files Files and versions

jiaxwang commited on Jan 23

Commit

eb49a2b

·

verified ·

1 Parent(s): 1159846

Create README.md

Files changed (1) hide show

README.md +92 -0

README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+---
+license: other
+license_name: modified-mit
+license_link: LICENSE
+base_model:
+- moonshotai/Kimi-K2-Instruct-0905
+---
+# Model Overview
+- **Model Architecture:** Kimi-K2-Instruct
+  - **Input:** Text
+  - **Output:** Text
+- **Supported Hardware Microarchitecture:** AMD MI350/MI355
+- **ROCm:** 7.0
+- **Operating System(s):** Linux
+- **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
+- **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
+  - **Weight quantization:** MOE-only, OCP MXFP4, Static
+  - **Activation quantization:** MOE-only, OCP MXFP4, Dynamic
+- **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
+This model was built with Kimi-K2-Thinking model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
+# Model Quantization
+The model was quantized from [moonshotai/Kimi-K2-Instruct-0905](https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
+# Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
+## Evaluation
+The model was evaluated on GSM8K benchmarks.
+### Accuracy
+<table>
+  <tr>
+   <td><strong>Benchmark</strong>
+   </td>
+   <td><strong>Kimi-K2-Instruct-0905 </strong>
+   </td>
+   <td><strong>Kimi-K2-Instruct-0905-MXFP4(this model)</strong>
+   </td>
+   <td><strong>Recovery</strong>
+   </td>
+  </tr>
+  <tr>
+   <td>GSM8K
+   </td>
+   <td>95.53
+   </td>
+   <td>94.47
+   </td>
+   <td>98.89%
+   </td>
+  </tr>
+</table>
+### Reproduction
+The GSM8K results were obtained using the `lm-evaluation-harness` framework, based on the Docker image `rocm/vllm-private:vllm_dev_base_mxfp4_20260122`, with vLLM and lm-eval compiled and installed from source inside the image.
+#### Launching server
+```
+export VLLM_ATTENTION_BACKEND="TRITON_MLA"
+export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
+vllm serve amd/Kimi-K2-Instruct-0905-MXFP4 \
+  --port 8000 \
+  --served-model-name kimi-k2-mxfp4 \
+  --trust-remote-code \
+  --tensor-parallel-size 8 \
+  --enable-auto-tool-choice \
+  --tool-call-parser kimi_k2
+```
+#### Evaluating model in a new terminal
+```
+lm_eval \
+  --model local-completions \
+  --model_args "model=kimi-k2-mxfp4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
+  --tasks gsm8k \
+  --num_fewshot 5 \
+  --batch_size 1
+```
+# License
+Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.