Files changed (2) hide show
  1. README.md +98 -2
  2. chat_template.jinja +97 -0
README.md CHANGED
@@ -2,8 +2,104 @@
2
  license: other
3
  license_name: modified-mit
4
  license_link: LICENSE
 
 
5
  ---
6
 
7
- # Disclaimer
8
 
9
- This model is provided for experimental purposes only. Its accuracy, stability, and suitability for deployment are not guaranteed. Users are advised to independently evaluate the model before any practical or production use.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: other
3
  license_name: modified-mit
4
  license_link: LICENSE
5
+ base_model:
6
+ - moonshotai/Kimi-K2-Thinking
7
  ---
8
 
9
+ # Model Overview
10
 
11
+ - **Model Architecture:** Kimi-K2-Thinking
12
+ - **Input:** Text
13
+ - **Output:** Text
14
+ - **Supported Hardware Microarchitecture:** AMD MI350/MI355
15
+ - **ROCm**: 7.0
16
+ - **Operating System(s):** Linux
17
+ - **Inference Engine:** [vLLM](https://docs.vllm.ai/en/latest/)
18
+ - **Model Optimizer:** [AMD-Quark](https://quark.docs.amd.com/latest/index.html)
19
+ - **Weight quantization:** OCP MXFP4, Static
20
+ - **Activation quantization:** OCP MXFP4, Dynamic
21
+ - **Calibration Dataset:** [Pile](https://huggingface.co/datasets/mit-han-lab/pile-val-backup)
22
+
23
+ This model was built with Kimi-K2-Thinking model by applying [AMD-Quark](https://quark.docs.amd.com/latest/index.html) for MXFP4 quantization.
24
+
25
+ # Model Quantization
26
+
27
+ The model was quantized from [moonshotai/Kimi-K2-Thinking](https://huggingface.co/moonshotai/Kimi-K2-Thinking) using [AMD-Quark](https://quark.docs.amd.com/latest/index.html). The weights and activations are quantized to MXFP4.
28
+
29
+ **Quantization scripts:**
30
+ ```
31
+ cd Quark/examples/torch/language_modeling/llm_ptq/
32
+ exclude_layers="*self_attn* *mlp.gate *lm_head *mlp.gate_proj *mlp.up_proj *mlp.down_proj *shared_experts.*"
33
+
34
+ python3 quantize_quark.py \
35
+ --model_dir moonshotai/Kimi-K2-Thinking \
36
+ --quant_scheme mxfp4 \
37
+ --exclude_layers $exclude_layers \
38
+ --num_calib_data 128 \
39
+ --output_dir amd/Kimi-K2-Thinking-MXFP4 \
40
+ --model_export hf_format \
41
+ --device cpu
42
+ ```
43
+
44
+ # Deployment
45
+ ### Use with vLLM
46
+
47
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend.
48
+
49
+ ## Evaluation
50
+ The model was evaluated on GSM8K benchmarks.
51
+
52
+ ### Accuracy
53
+
54
+ <table>
55
+ <tr>
56
+ <td><strong>Benchmark</strong>
57
+ </td>
58
+ <td><strong>Kimi-K2-Thinking </strong>
59
+ </td>
60
+ <td><strong>Kimi-K2-Thinking-MXFP4(this model)</strong>
61
+ </td>
62
+ <td><strong>Recovery</strong>
63
+ </td>
64
+ </tr>
65
+ <tr>
66
+ <td>GSM8K
67
+ </td>
68
+ <td>94.16
69
+ </td>
70
+ <td>93.48
71
+ </td>
72
+ <td>99.28%
73
+ </td>
74
+ </tr>
75
+ </table>
76
+
77
+ ### Reproduction
78
+ The GSM8K results were obtained using the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness.git) framework, based on the Docker image `rocm/vllm-dev:base`, with vLLM and lm-eval compiled and installed from source inside the container.
79
+
80
+ #### Launching server
81
+ ```
82
+ export VLLM_ATTENTION_BACKEND="TRITON_MLA"
83
+ export VLLM_ROCM_USE_AITER=1
84
+ export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=0
85
+
86
+ vllm serve amd/Kimi-K2-Thinking-MXFP4 \
87
+ --tensor-parallel-size 8 \
88
+ --enable-auto-tool-choice \
89
+ --tool-call-parser kimi_k2 \
90
+ --reasoning-parser kimi_k2 \
91
+ --trust-remote-code
92
+ ```
93
+
94
+ #### Evaluating model in a new terminal
95
+ ```
96
+ lm_eval \
97
+ --model local-completions \
98
+ --model_args "model=amd/Kimi-K2-Thinking-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
99
+ --tasks gsm8k \
100
+ --num_fewshot 5 \
101
+ --batch_size 1
102
+ ```
103
+
104
+ # License
105
+ Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
chat_template.jinja ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- macro render_content(msg) -%}
2
+ {%- set c = msg.get('content') -%}
3
+ {%- if c is string -%}
4
+ {{ c }}
5
+ {%- elif c is not none -%}
6
+ {% for content in c -%}
7
+ {% if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}
8
+ <|media_start|>image<|media_content|><|media_pad|><|media_end|>
9
+ {% else -%}
10
+ {{ content['text'] }}
11
+ {%- endif -%}
12
+ {%- endfor -%}
13
+ {%- endif -%}
14
+ {%- endmacro -%}
15
+
16
+ {% macro set_roles(message) -%}
17
+ {%- set role_name = message.get('name') or message['role'] -%}
18
+ {%- if message['role'] == 'user' -%}
19
+ <|im_user|>{{role_name}}<|im_middle|>
20
+ {%- elif message['role'] == 'assistant' -%}
21
+ <|im_assistant|>{{role_name}}<|im_middle|>
22
+ {%- else -%}
23
+ <|im_system|>{{role_name}}<|im_middle|>
24
+ {%- endif -%}
25
+ {%- endmacro -%}
26
+
27
+
28
+ {%- macro render_toolcalls(message) -%}
29
+ <|tool_calls_section_begin|>
30
+ {%- for tool_call in message['tool_calls'] -%}
31
+ {%- set formatted_id = tool_call['id'] -%}
32
+ <|tool_call_begin|>{{ formatted_id }}<|tool_call_argument_begin|>{% if tool_call['function']['arguments'] is string %}{{ tool_call['function']['arguments'] }}{% else %}{{ tool_call['function']['arguments'] | tojson }}{% endif %}<|tool_call_end|>
33
+ {%- endfor -%}
34
+ <|tool_calls_section_end|>
35
+ {%- endmacro -%}
36
+
37
+
38
+ {# Find last non-tool-call assisitant message #}
39
+ {%- set ns = namespace(last_non_tool_call_assistant_msg=-1) -%}
40
+ {%- for idx in range(messages|length-1, -1, -1) -%}
41
+ {%- if messages[idx]['role'] == 'assistant' and not messages[idx].get('tool_calls') -%}
42
+ {%- set ns.last_non_tool_call_assistant_msg = idx -%}
43
+ {%- break -%}
44
+ {%- endif -%}
45
+ {%- endfor -%}
46
+
47
+ {# split all messages into history & suffix, reasoning_content in suffix should be reserved.#}
48
+ {%- set hist_msgs = messages[:ns.last_non_tool_call_assistant_msg+1] -%}
49
+ {%- set suffix_msgs = messages[ns.last_non_tool_call_assistant_msg+1:] -%}
50
+
51
+ {%- if tools -%}
52
+ <|im_system|>tool_declare<|im_middle|>{{ tools | tojson(separators=(',', ':')) }}<|im_end|>
53
+ {%- endif -%}
54
+
55
+ {%- if messages|length == 0 or messages[0]['role'] != 'system' -%}
56
+ <|im_system|>system<|im_middle|>You are Kimi, an AI assistant created by Moonshot AI.<|im_end|>
57
+ {%- endif -%}
58
+
59
+ {%- for message in hist_msgs -%}
60
+ {{set_roles(message)}}
61
+ {%- if message['role'] == 'assistant' -%}
62
+ <think></think>{{render_content(message)}}
63
+ {%- if message.get('tool_calls') -%}
64
+ {{render_toolcalls(message)}}
65
+ {%- endif -%}
66
+ {%- elif message['role'] == 'tool' -%}
67
+ {%- set tool_call_id = message.tool_call_id -%}
68
+ ## Return of {{ tool_call_id }}
69
+ {{render_content(message)}}
70
+ {%- elif message['content'] is not none -%}
71
+ {{render_content(message)}}
72
+ {%- endif -%}
73
+ <|im_end|>
74
+ {%- endfor -%}
75
+
76
+ {%- for message in suffix_msgs -%}
77
+ {{set_roles(message)}}
78
+ {%- if message['role'] == 'assistant' -%}
79
+ {%- set rc = message.get('reasoning_content', '') -%}
80
+ <think>{{rc}}</think>{{render_content(message)}}
81
+ {%- if message.get('tool_calls') -%}
82
+ {{render_toolcalls(message)}}
83
+ {%- endif -%}
84
+ {%- elif message['role'] == 'tool' -%}
85
+ {%- set tool_call_id = message.tool_call_id -%}
86
+ ## Return of {{ tool_call_id }}
87
+ {{render_content(message)}}
88
+ {%- elif message['content'] is not none -%}
89
+ {{render_content(message)}}
90
+ {%- endif -%}
91
+ <|im_end|>
92
+ {%- endfor -%}
93
+
94
+
95
+ {%- if add_generation_prompt -%}
96
+ <|im_assistant|>assistant<|im_middle|>
97
+ {%- endif -%}