shubhrapandit commited on
Commit
be95cd5
·
verified ·
1 Parent(s): 9681069

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -1
README.md CHANGED
@@ -151,17 +151,134 @@ oneshot(
151
 
152
  ## Evaluation
153
 
154
- The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
155
 
156
  <details>
157
  <summary>Evaluation Commands</summary>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
 
 
 
 
159
  ```
 
 
 
 
 
 
 
 
160
  ```
161
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
162
  </details>
163
 
 
164
  ### Accuracy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
  ## Inference Performance
167
 
 
151
 
152
  ## Evaluation
153
 
154
+ The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:
155
 
156
  <details>
157
  <summary>Evaluation Commands</summary>
158
+
159
+ ### Vision Tasks
160
+ - vqav2
161
+ - docvqa
162
+ - mathvista
163
+ - mmmu
164
+ - chartqa
165
+
166
+ ```
167
+ vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7
168
+
169
+ python -m eval.run eval_vllm \
170
+ --model_name neuralmagic/pixtral-12b-quantized.w4a16 \
171
+ --url http://0.0.0.0:8000 \
172
+ --output_dir ~/tmp
173
+ --eval_name <vision_task_name>
174
+ ```
175
 
176
+ ### Text-based Tasks
177
+ #### MMLU
178
+
179
  ```
180
+ lm_eval \
181
+ --model vllm \
182
+ --model_args pretrained="neuralmagic/pixtral-12b-quantized.w4a16 ",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
183
+ --tasks mmlu \
184
+ --num_fewshot 5
185
+ --batch_size auto \
186
+ --output_path output_dir \
187
+
188
  ```
189
 
190
+ #### HumanEval
191
+
192
+ ##### Generation
193
+ ```
194
+ python3 codegen/generate.py \
195
+ --model neuralmagic/pixtral-12b-quantized.w4a16 \
196
+ --bs 16 \
197
+ --temperature 0.2 \
198
+ --n_samples 50 \
199
+ --root "." \
200
+ --dataset humaneval
201
+ ```
202
+ ##### Sanitization
203
+ ```
204
+ python3 evalplus/sanitize.py \
205
+ humaneval/neuralmagic/pixtral-12b-quantized.w4a16_vllm_temp_0.2
206
+ ```
207
+ ##### Evaluation
208
+ ```
209
+ evalplus.evaluate \
210
+ --dataset humaneval \
211
+ --samples humaneval/neuralmagic/pixtral-12b-quantized.w4a16_vllm_temp_0.2-sanitized
212
+ ```
213
  </details>
214
 
215
+
216
  ### Accuracy
217
+ <table>
218
+ <thead>
219
+ <tr>
220
+ <th>Category</th>
221
+ <th>Metric</th>
222
+ <th>mgoin/pixtral-12b</th>
223
+ <th>neuralmagic/pixtral-12b-quantized.w4a16</th>
224
+ <th>Recovery (%)</th>
225
+ </tr>
226
+ </thead>
227
+ <tbody>
228
+ <tr>
229
+ <td rowspan="6"><b>Vision</b></td>
230
+ <td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i><br></td>
231
+ <td>48.00</td>
232
+ <td>44.67</td>
233
+ <td>93.06%</td>
234
+ </tr>
235
+ <tr>
236
+ <td>VQAv2 (val)<br><i>vqa_match</i></td>
237
+ <td>78.71</td>
238
+ <td>77.04</td>
239
+ <td>97.88%</td>
240
+ </tr>
241
+ <tr>
242
+ <td>DocVQA (val)<br><i>anls</i></td>
243
+ <td>89.47</td>
244
+ <td>89.02</td>
245
+ <td>99.50%</td>
246
+ </tr>
247
+ <tr>
248
+ <td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
249
+ <td>81.68</td>
250
+ <td>82.12</td>
251
+ <td>100.54%</td>
252
+ </tr>
253
+ <tr>
254
+ <td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
255
+ <td>56.50</td>
256
+ <td>54.40</td>
257
+ <td>96.28%</td>
258
+ </tr>
259
+ <tr>
260
+ <td><b>Average Score</b></td>
261
+ <td><b></b></td>
262
+ <td><b>70.07</b></td>
263
+ <td><b>69.05</b></td>
264
+ <td><b>98.54%</b></td>
265
+ </tr>
266
+ <tr>
267
+ <td rowspan="2"><b>Text</b></td>
268
+ <td>HumanEval <br><i>pass@1</i></td>
269
+ <td>71.40</td>
270
+ <td>63.80</td>
271
+ <td>89.37%</td>
272
+ </tr>
273
+ <tr>
274
+ <td>MMLU (5-shot)</td>
275
+ <td>68.40</td>
276
+ <td>65.56</td>
277
+ <td>95.86%</td>
278
+ </tr>
279
+ </tbody>
280
+ </table>
281
+
282
 
283
  ## Inference Performance
284