Update model card with evaluation results
#6
by jingyux-nv - opened
README.md
CHANGED
|
@@ -102,12 +102,67 @@ This model was obtained by converting and quantizing the weights and activations
|
|
| 102 |
## Usage
|
| 103 |
|
| 104 |
|
| 105 |
-
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can start the docker `vllm/vllm-openai:
|
| 106 |
|
| 107 |
```sh
|
| 108 |
python3 -m vllm.entrypoints.openai.api_server --model nvidia/Kimi-K2.5-NVFP4 --tensor-parallel-size 4 --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --trust-remote-code
|
| 109 |
```
|
| 110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
## Model Limitations:
|
| 113 |
The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|
|
|
|
| 102 |
## Usage
|
| 103 |
|
| 104 |
|
| 105 |
+
To serve this checkpoint with [vLLM](https://github.com/vllm-project/vllm), you can start the docker `vllm/vllm-openai:latest` and run the sample command below:
|
| 106 |
|
| 107 |
```sh
|
| 108 |
python3 -m vllm.entrypoints.openai.api_server --model nvidia/Kimi-K2.5-NVFP4 --tensor-parallel-size 4 --tool-call-parser kimi_k2 --reasoning-parser kimi_k2 --trust-remote-code
|
| 109 |
```
|
| 110 |
|
| 111 |
+
## Evaluation
|
| 112 |
+
The accuracy benchmark results are presented in the table below:
|
| 113 |
+
<table>
|
| 114 |
+
<tr>
|
| 115 |
+
<td><strong>Precision</strong>
|
| 116 |
+
</td>
|
| 117 |
+
<td><strong>MMLU Pro</strong>
|
| 118 |
+
</td>
|
| 119 |
+
<td><strong>LiveCodeBench V6</strong>
|
| 120 |
+
</td>
|
| 121 |
+
<td><strong>SciCode</strong>
|
| 122 |
+
</td>
|
| 123 |
+
<td><strong>AIME 2025</strong>
|
| 124 |
+
</td>
|
| 125 |
+
</tr>
|
| 126 |
+
<tr>
|
| 127 |
+
<td>Baseline (official)
|
| 128 |
+
</td>
|
| 129 |
+
<td><strong>87.1</strong>
|
| 130 |
+
</td>
|
| 131 |
+
<td><strong>85.0</strong>
|
| 132 |
+
</td>
|
| 133 |
+
<td><strong>48.7</strong>
|
| 134 |
+
</td>
|
| 135 |
+
<td><strong>96.1</strong>
|
| 136 |
+
</td>
|
| 137 |
+
</tr>
|
| 138 |
+
<tr>
|
| 139 |
+
<td>Baseline (ours)
|
| 140 |
+
</td>
|
| 141 |
+
<td><strong>86.9</strong>
|
| 142 |
+
</td>
|
| 143 |
+
<td><strong>84.7</strong>
|
| 144 |
+
</td>
|
| 145 |
+
<td><strong>47.7</strong>
|
| 146 |
+
</td>
|
| 147 |
+
<td><strong>96.5</strong>
|
| 148 |
+
</td>
|
| 149 |
+
</tr>
|
| 150 |
+
<tr>
|
| 151 |
+
<td>NVFP4
|
| 152 |
+
</td>
|
| 153 |
+
<td><strong>87.3</strong>
|
| 154 |
+
</td>
|
| 155 |
+
<td><strong>84.0</strong>
|
| 156 |
+
</td>
|
| 157 |
+
<td><strong>48.7</strong>
|
| 158 |
+
</td>
|
| 159 |
+
<td><strong>96.3</strong>
|
| 160 |
+
</td>
|
| 161 |
+
</tr>
|
| 162 |
+
</table>
|
| 163 |
+
|
| 164 |
+
> Baseline (official) numbers are from the [Kimi-K2.5 model card](https://huggingface.co/moonshotai/Kimi-K2.5).
|
| 165 |
+
> Evaluation settings follow the same configuration as described in the [Kimi-K2.5 model card](https://huggingface.co/moonshotai/Kimi-K2.5)
|
| 166 |
|
| 167 |
## Model Limitations:
|
| 168 |
The base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.
|