| | --- |
| | license: other |
| | license_name: modified-mit |
| | library_name: transformers |
| | base_model: |
| | - moonshotai/Kimi-K2-Thinking |
| | --- |
| | |
| | # Model Overview |
| |
|
| | - **Model Architecture:** Kimi-K2-Thinking |
| | - **Input:** Text |
| | - **Output:** Text |
| | - **Supported Hardware Microarchitecture:** AMD MI300/MI355 |
| | - **ROCm**: 7.0 |
| | - **PyTorch**: 2.8.0 |
| | - **Transformers**: 4.53.0 |
| | - **Operating System(s):** Linux |
| | - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/) |
| | - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10) |
| | - **Weight quantization:** INT4 Per-Channel & FP8E4M3, Static |
| | - **Activation quantization:** FP8E4M3, Dynamic |
| |
|
| | This model was built with moonshotai Kimi-K2-Thinking model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for INT4-FP8 quantization. |
| |
|
| | # Model Quantization |
| |
|
| | The model was quantized from [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). |
| |
|
| | # Deployment |
| |
|
| | This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backends. |
| |
|
| | ## Evaluation |
| |
|
| | The model was evaluated on GSM8K benchmarks using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework. |
| |
|
| | ### Accuracy |
| |
|
| | <table> |
| | <tr> |
| | <td><strong>Benchmark</strong> |
| | </td> |
| | <td><strong>Kimi-K2-Thinking </strong> |
| | </td> |
| | <td><strong>Kimi-K2-Thinking-W4A8(this model)</strong> |
| | </td> |
| | <td><strong>Recovery</strong> |
| | </td> |
| | </tr> |
| | <tr> |
| | <td>GSM8K |
| | </td> |
| | <td>93.93 |
| | </td> |
| | <td>93.4 |
| | </td> |
| | <td>99.4% |
| | </td> |
| | </tr> |
| | </table> |
| |
|
| |
|
| | ### Reproduction |
| |
|
| | The results of GSM8K were obtained using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and latest vLLM. |
| |
|
| | Launch vLLM |
| | ``` |
| | MODEL_DIR=/data/amd/Kimi-K2-Thinking-W4A8 |
| | VLLM_ATTENTION_BACKEND="TRITON_MLA" VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 VLLM_ROCM_USE_AITER_FP4BMM=0 vllm serve $MODEL_DIR \ |
| | --port 8001 \ |
| | --trust-remote-code \ |
| | --gpu-memory-utilization 0.9 \ |
| | --tensor-parallel-size 8 \ |
| | --load-format "fastsafetensors" |
| | ``` |
| |
|
| | GSM8K evaluation |
| | ``` |
| | MODEL_ARGS="model=/data/amd/Kimi-K2-Thinking-W4A8,base_url=http://localhost:8001/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=38768,temperature=0.6,top_p=0.95,add_bos_token=True,seed=$SEED,trust_remote_code=True" |
| | lm_eval \ |
| | --model local-completions \ |
| | --model_args $MODEL_ARGS \ |
| | --tasks gsm8k \ |
| | --num_fewshot 8 \ |
| | --batch_size auto |
| | ``` |
| |
|
| | # License |
| | Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved. |
| |
|