Files changed (1) hide show
  1. README.md +83 -2
README.md CHANGED
@@ -6,5 +6,86 @@ base_model:
6
  - moonshotai/Kimi-K2-Thinking
7
  ---
8
 
9
- - ## Disclaimer
10
- <span style="color:red; font-size:20px">Model under internal development and testing.</span>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  - moonshotai/Kimi-K2-Thinking
7
  ---
8
 
9
+ # Model Overview
10
+
11
+ - **Model Architecture:** Kimi-K2-Thinking
12
+ - **Input:** Text
13
+ - **Output:** Text
14
+ - **Supported Hardware Microarchitecture:** AMD MI300/MI355
15
+ - **ROCm**: 7.0
16
+ - **PyTorch**: 2.8.0
17
+ - **Transformers**: 4.53.0
18
+ - **Operating System(s):** Linux
19
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
20
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html) (V0.10)
21
+ - **Weight quantization:** INT4 Per-Channel & FP8E4M3, Static
22
+ - **Activation quantization:** FP8E4M3, Dynamic
23
+
24
+ This model was built with moonshotai Kimi-K2-Thinking model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for INT4-FP8 quantization.
25
+
26
+ # Model Quantization
27
+
28
+ The model was quantized from [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). Both weights and activations were quantized to MXFP4 format.
29
+
30
+ # Deployment
31
+
32
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backends.
33
+
34
+ ## Evaluation
35
+
36
+ The model was evaluated on GSM8K benchmarks using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) framework.
37
+
38
+ ### Accuracy
39
+
40
+ <table>
41
+ <tr>
42
+ <td><strong>Benchmark</strong>
43
+ </td>
44
+ <td><strong>Kimi-K2-Thinking </strong>
45
+ </td>
46
+ <td><strong>Kimi-K2-Thinking-W4A8(this model)</strong>
47
+ </td>
48
+ <td><strong>Recovery</strong>
49
+ </td>
50
+ </tr>
51
+ <tr>
52
+ <td>GSM8K
53
+ </td>
54
+ <td>93.93
55
+ </td>
56
+ <td>93.4
57
+ </td>
58
+ <td>99.4%
59
+ </td>
60
+ </tr>
61
+ </table>
62
+
63
+
64
+ ### Reproduction
65
+
66
+ The results of GSM8K were obtained using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and latest vLLM.
67
+
68
+ Launch vLLM
69
+ ```
70
+ MODEL_DIR=/data/amd/Kimi-K2-Thinking-W4A8
71
+ VLLM_ATTENTION_BACKEND="TRITON_MLA" VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0 vllm serve $MODEL_DIR \
72
+ --port 8001 \
73
+ --trust-remote-code \
74
+ --gpu-memory-utilization 0.9 \
75
+ --tensor-parallel-size 8 \
76
+ --load-format "fastsafetensors"
77
+ ```
78
+
79
+ GSM8K evaluation
80
+ ```
81
+ MODEL_ARGS="model=/data/amd/Kimi-K2-Thinking-W4A8,base_url=http://localhost:8001/v1/completions,num_concurrent=999999,timeout=999999,tokenized_requests=False,max_length=38768,temperature=0.6,top_p=0.95,add_bos_token=True,seed=$SEED,trust_remote_code=True"
82
+ lm_eval \
83
+ --model local-completions \
84
+ --model_args $MODEL_ARGS \
85
+ --tasks gsm8k \
86
+ --num_fewshot 8 \
87
+ --batch_size auto
88
+ ```
89
+
90
+ # License
91
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.