shubhrapandit commited on
Commit
93ff13a
·
verified ·
1 Parent(s): 0dfd32e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -22
README.md CHANGED
@@ -172,25 +172,28 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
172
  <th></th>
173
  <th></th>
174
  <th></th>
 
175
  <th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
176
  <th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
177
  <th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
178
  </tr>
179
  <tr>
180
  <th>Hardware</th>
 
181
  <th>Model</th>
182
  <th>Average Cost Reduction</th>
183
  <th>Latency (s)</th>
184
- <th>QPD</th>
185
  <th>Latency (s)th>
186
- <th>QPD</th>
187
  <th>Latency (s)</th>
188
- <th>QPD</th>
189
  </tr>
190
  </thead>
191
  <tbody style="text-align: center">
192
  <tr>
193
- <td>A100x4</td>
 
194
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
195
  <td></td>
196
  <td>7.5</td>
@@ -201,7 +204,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
201
  <td>79</td>
202
  </tr>
203
  <tr>
204
- <td>A100x2</td>
205
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
206
  <td>1.86</td>
207
  <td>8.1</td>
@@ -212,7 +215,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
212
  <td>148</td>
213
  </tr>
214
  <tr>
215
- <td>A100x2</td>
216
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
217
  <td>2.52</td>
218
  <td>6.9</td>
@@ -223,7 +226,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
223
  <td>221</td>
224
  </tr>
225
  <tr>
226
- <td>H100x4</td>
 
227
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
228
  <td></td>
229
  <td>4.4</td>
@@ -234,7 +238,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
234
  <td>79</td>
235
  </tr>
236
  <tr>
237
- <td>H100x2</td>
238
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
239
  <td>1.82</td>
240
  <td>4.7</td>
@@ -245,7 +249,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
245
  <td>145</td>
246
  </tr>
247
  <tr>
248
- <td>H100x2</td>
249
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
250
  <td>1.87</td>
251
  <td>4.7</td>
@@ -258,7 +262,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
258
  </tbody>
259
  </table>
260
 
 
261
 
 
262
 
263
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
264
 
@@ -277,16 +283,16 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
277
  <th>Model</th>
278
  <th>Average Cost Reduction</th>
279
  <th>Maximum throughput (QPS)</th>
280
- <th>QPD</th>
281
  <th>Maximum throughput (QPS)</th>
282
- <th>QPD</th>
283
  <th>Maximum throughput (QPS)</th>
284
- <th>QPD</th>
285
  </tr>
286
  </thead>
287
  <tbody style="text-align: center">
288
  <tr>
289
- <td>A100x4</td>
290
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
291
  <td></td>
292
  <td>0.4</td>
@@ -297,10 +303,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
297
  <td>399</td>
298
  </tr>
299
  <tr>
300
- <td>A100x2</td>
301
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
302
  <td>1.70</td>
303
- <td>0.8</td>
304
  <td>383</td>
305
  <td>1.1</td>
306
  <td>571</td>
@@ -308,10 +313,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
308
  <td>674</td>
309
  </tr>
310
  <tr>
311
- <td>A100x2</td>
312
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
313
  <td>1.48</td>
314
- <td>0.5</td>
315
  <td>276</td>
316
  <td>1.0</td>
317
  <td>505</td>
@@ -319,7 +323,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
319
  <td>680</td>
320
  </tr>
321
  <tr>
322
- <td>H100x4</td>
323
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
324
  <td></td>
325
  <td>1.0</td>
@@ -330,10 +334,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
330
  <td>511</td>
331
  </tr>
332
  <tr>
333
- <td>H100x2</td>
334
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
335
  <td>1.61</td>
336
- <td>1.7</td>
337
  <td>467</td>
338
  <td>2.6</td>
339
  <td>726</td>
@@ -341,10 +344,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
341
  <td>908</td>
342
  </tr>
343
  <tr>
344
- <td>H100x2</td>
345
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
346
  <td>1.33</td>
347
- <td>1.4</td>
348
  <td>393</td>
349
  <td>2.2</td>
350
  <td>634</td>
@@ -353,3 +355,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
353
  </tr>
354
  </tbody>
355
  </table>
 
 
 
 
 
 
 
172
  <th></th>
173
  <th></th>
174
  <th></th>
175
+ <th></th>
176
  <th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
177
  <th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
178
  <th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
179
  </tr>
180
  <tr>
181
  <th>Hardware</th>
182
+ <th>Number of GPUs</th>
183
  <th>Model</th>
184
  <th>Average Cost Reduction</th>
185
  <th>Latency (s)</th>
186
+ <th>Queries Per Dollar</th>
187
  <th>Latency (s)th>
188
+ <th>Queries Per Dollar</th>
189
  <th>Latency (s)</th>
190
+ <th>Queries Per Dollar</th>
191
  </tr>
192
  </thead>
193
  <tbody style="text-align: center">
194
  <tr>
195
+ <th rowspan="3" valign="top">A100</th>
196
+ <td>4</td>
197
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
198
  <td></td>
199
  <td>7.5</td>
 
204
  <td>79</td>
205
  </tr>
206
  <tr>
207
+ <td>2</td>
208
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
209
  <td>1.86</td>
210
  <td>8.1</td>
 
215
  <td>148</td>
216
  </tr>
217
  <tr>
218
+ <td>2</td>
219
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
220
  <td>2.52</td>
221
  <td>6.9</td>
 
226
  <td>221</td>
227
  </tr>
228
  <tr>
229
+ <th rowspan="3" valign="top">H100</th>
230
+ <td>4</td>
231
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
232
  <td></td>
233
  <td>4.4</td>
 
238
  <td>79</td>
239
  </tr>
240
  <tr>
241
+ <td>2</td>
242
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
243
  <td>1.82</td>
244
  <td>4.7</td>
 
249
  <td>145</td>
250
  </tr>
251
  <tr>
252
+ <td>2</td>
253
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
254
  <td>1.87</td>
255
  <td>4.7</td>
 
262
  </tbody>
263
  </table>
264
 
265
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
266
 
267
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
268
 
269
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
270
 
 
283
  <th>Model</th>
284
  <th>Average Cost Reduction</th>
285
  <th>Maximum throughput (QPS)</th>
286
+ <th>Queries Per Dollar</th>
287
  <th>Maximum throughput (QPS)</th>
288
+ <th>Queries Per Dollar</th>
289
  <th>Maximum throughput (QPS)</th>
290
+ <th>Queries Per Dollar</th>
291
  </tr>
292
  </thead>
293
  <tbody style="text-align: center">
294
  <tr>
295
+ <th rowspan="3" valign="top">A100x4</th>
296
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
297
  <td></td>
298
  <td>0.4</td>
 
303
  <td>399</td>
304
  </tr>
305
  <tr>
 
306
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
307
  <td>1.70</td>
308
+ <td>1.6</td>
309
  <td>383</td>
310
  <td>1.1</td>
311
  <td>571</td>
 
313
  <td>674</td>
314
  </tr>
315
  <tr>
 
316
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
317
  <td>1.48</td>
318
+ <td>1.0</td>
319
  <td>276</td>
320
  <td>1.0</td>
321
  <td>505</td>
 
323
  <td>680</td>
324
  </tr>
325
  <tr>
326
+ <<th rowspan="3" valign="top">H100x4</th>
327
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
328
  <td></td>
329
  <td>1.0</td>
 
334
  <td>511</td>
335
  </tr>
336
  <tr>
 
337
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
338
  <td>1.61</td>
339
+ <td>3.4</td>
340
  <td>467</td>
341
  <td>2.6</td>
342
  <td>726</td>
 
344
  <td>908</td>
345
  </tr>
346
  <tr>
 
347
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
348
  <td>1.33</td>
349
+ <td>2.8</td>
350
  <td>393</td>
351
  <td>2.2</td>
352
  <td>634</td>
 
355
  </tr>
356
  </tbody>
357
  </table>
358
+
359
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
360
+
361
+ **QPS: Queries per second.
362
+
363
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).