shubhrapandit commited on
Commit
5a8bb2f
·
verified ·
1 Parent(s): 3bc918c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +37 -30
README.md CHANGED
@@ -554,25 +554,28 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
554
  <th></th>
555
  <th></th>
556
  <th></th>
 
557
  <th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
558
  <th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
559
  <th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
560
  </tr>
561
  <tr>
562
  <th>Hardware</th>
 
563
  <th>Model</th>
564
  <th>Average Cost Reduction</th>
565
  <th>Latency (s)</th>
566
- <th>QPD</th>
567
  <th>Latency (s)th>
568
- <th>QPD</th>
569
  <th>Latency (s)</th>
570
- <th>QPD</th>
571
  </tr>
572
  </thead>
573
  <tbody style="text-align: center">
574
  <tr>
575
- <td>A100x4</td>
 
576
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
577
  <td></td>
578
  <td>7.5</td>
@@ -583,7 +586,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
583
  <td>79</td>
584
  </tr>
585
  <tr>
586
- <td>A100x2</td>
587
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
588
  <td>1.86</td>
589
  <td>8.1</td>
@@ -594,7 +597,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
594
  <td>148</td>
595
  </tr>
596
  <tr>
597
- <td>A100x2</td>
598
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
599
  <td>2.52</td>
600
  <td>6.9</td>
@@ -605,7 +608,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
605
  <td>221</td>
606
  </tr>
607
  <tr>
608
- <td>H100x4</td>
 
609
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
610
  <td></td>
611
  <td>4.4</td>
@@ -616,7 +620,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
616
  <td>79</td>
617
  </tr>
618
  <tr>
619
- <td>H100x2</td>
620
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
621
  <td>1.82</td>
622
  <td>4.7</td>
@@ -627,7 +631,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
627
  <td>145</td>
628
  </tr>
629
  <tr>
630
- <td>H100x2</td>
631
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
632
  <td>1.87</td>
633
  <td>4.7</td>
@@ -640,7 +644,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
640
  </tbody>
641
  </table>
642
 
 
643
 
 
644
 
645
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
646
 
@@ -659,16 +665,16 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
659
  <th>Model</th>
660
  <th>Average Cost Reduction</th>
661
  <th>Maximum throughput (QPS)</th>
662
- <th>QPD</th>
663
  <th>Maximum throughput (QPS)</th>
664
- <th>QPD</th>
665
  <th>Maximum throughput (QPS)</th>
666
- <th>QPD</th>
667
  </tr>
668
  </thead>
669
  <tbody style="text-align: center">
670
  <tr>
671
- <td>A100x4</td>
672
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
673
  <td></td>
674
  <td>0.4</td>
@@ -679,29 +685,27 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
679
  <td>399</td>
680
  </tr>
681
  <tr>
682
- <td>A100x2</td>
683
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
684
  <td>1.70</td>
685
- <td>0.8</td>
686
  <td>383</td>
687
- <td>1.1</td>
688
  <td>571</td>
689
- <td>1.3</td>
690
  <td>674</td>
691
  </tr>
692
  <tr>
693
- <td>A100x2</td>
694
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
695
  <td>1.48</td>
696
- <td>0.5</td>
697
- <td>276</td>
698
  <td>1.0</td>
 
 
699
  <td>505</td>
700
- <td>1.4</td>
701
  <td>680</td>
702
  </tr>
703
  <tr>
704
- <td>H100x4</td>
705
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
706
  <td></td>
707
  <td>1.0</td>
@@ -712,30 +716,33 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
712
  <td>511</td>
713
  </tr>
714
  <tr>
715
- <td>H100x2</td>
716
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
717
  <td>1.61</td>
718
- <td>1.7</td>
719
  <td>467</td>
720
- <td>2.6</td>
721
  <td>726</td>
722
- <td>3.2</td>
723
  <td>908</td>
724
  </tr>
725
  <tr>
726
- <td>H100x2</td>
727
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
728
  <td>1.33</td>
729
- <td>1.4</td>
730
  <td>393</td>
731
- <td>2.2</td>
732
  <td>634</td>
733
- <td>2.7</td>
734
  <td>764</td>
735
  </tr>
736
  </tbody>
737
  </table>
738
 
 
 
 
 
 
739
 
740
  ## The Mistral AI Team
741
 
 
554
  <th></th>
555
  <th></th>
556
  <th></th>
557
+ <th></th>
558
  <th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
559
  <th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
560
  <th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
561
  </tr>
562
  <tr>
563
  <th>Hardware</th>
564
+ <th>Number of GPUs</th>
565
  <th>Model</th>
566
  <th>Average Cost Reduction</th>
567
  <th>Latency (s)</th>
568
+ <th>Queries Per Dollar</th>
569
  <th>Latency (s)th>
570
+ <th>Queries Per Dollar</th>
571
  <th>Latency (s)</th>
572
+ <th>Queries Per Dollar</th>
573
  </tr>
574
  </thead>
575
  <tbody style="text-align: center">
576
  <tr>
577
+ <th rowspan="3" valign="top">A100</th>
578
+ <td>4</td>
579
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
580
  <td></td>
581
  <td>7.5</td>
 
586
  <td>79</td>
587
  </tr>
588
  <tr>
589
+ <td>2</td>
590
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
591
  <td>1.86</td>
592
  <td>8.1</td>
 
597
  <td>148</td>
598
  </tr>
599
  <tr>
600
+ <td>2</td>
601
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
602
  <td>2.52</td>
603
  <td>6.9</td>
 
608
  <td>221</td>
609
  </tr>
610
  <tr>
611
+ <th rowspan="3" valign="top">H100</th>
612
+ <td>4</td>
613
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
614
  <td></td>
615
  <td>4.4</td>
 
620
  <td>79</td>
621
  </tr>
622
  <tr>
623
+ <td>2</td>
624
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
625
  <td>1.82</td>
626
  <td>4.7</td>
 
631
  <td>145</td>
632
  </tr>
633
  <tr>
634
+ <td>2</td>
635
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
636
  <td>1.87</td>
637
  <td>4.7</td>
 
644
  </tbody>
645
  </table>
646
 
647
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
648
 
649
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
650
 
651
  ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
652
 
 
665
  <th>Model</th>
666
  <th>Average Cost Reduction</th>
667
  <th>Maximum throughput (QPS)</th>
668
+ <th>Queries Per Dollar</th>
669
  <th>Maximum throughput (QPS)</th>
670
+ <th>Queries Per Dollar</th>
671
  <th>Maximum throughput (QPS)</th>
672
+ <th>Queries Per Dollar</th>
673
  </tr>
674
  </thead>
675
  <tbody style="text-align: center">
676
  <tr>
677
+ <th rowspan="3" valign="top">A100x4</th>
678
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
679
  <td></td>
680
  <td>0.4</td>
 
685
  <td>399</td>
686
  </tr>
687
  <tr>
 
688
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
689
  <td>1.70</td>
690
+ <td>1.6</td>
691
  <td>383</td>
692
+ <td>2.2</td>
693
  <td>571</td>
694
+ <td>2.6</td>
695
  <td>674</td>
696
  </tr>
697
  <tr>
 
698
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
699
  <td>1.48</td>
 
 
700
  <td>1.0</td>
701
+ <td>276</td>
702
+ <td>2.0</td>
703
  <td>505</td>
704
+ <td>2.8</td>
705
  <td>680</td>
706
  </tr>
707
  <tr>
708
+ <<th rowspan="3" valign="top">H100x4</th>
709
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
710
  <td></td>
711
  <td>1.0</td>
 
716
  <td>511</td>
717
  </tr>
718
  <tr>
 
719
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
720
  <td>1.61</td>
721
+ <td>3.4</td>
722
  <td>467</td>
723
+ <td>5.2</td>
724
  <td>726</td>
725
+ <td>6.4</td>
726
  <td>908</td>
727
  </tr>
728
  <tr>
 
729
  <td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
730
  <td>1.33</td>
731
+ <td>2.8</td>
732
  <td>393</td>
733
+ <td>4.4</td>
734
  <td>634</td>
735
+ <td>5.4</td>
736
  <td>764</td>
737
  </tr>
738
  </tbody>
739
  </table>
740
 
741
+ **Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
742
+
743
+ **QPS: Queries per second.
744
+
745
+ **QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
746
 
747
  ## The Mistral AI Team
748