Update README.md
Browse files
README.md
CHANGED
|
@@ -554,25 +554,28 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 554 |
<th></th>
|
| 555 |
<th></th>
|
| 556 |
<th></th>
|
|
|
|
| 557 |
<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
|
| 558 |
<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
|
| 559 |
<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
|
| 560 |
</tr>
|
| 561 |
<tr>
|
| 562 |
<th>Hardware</th>
|
|
|
|
| 563 |
<th>Model</th>
|
| 564 |
<th>Average Cost Reduction</th>
|
| 565 |
<th>Latency (s)</th>
|
| 566 |
-
<th>
|
| 567 |
<th>Latency (s)th>
|
| 568 |
-
<th>
|
| 569 |
<th>Latency (s)</th>
|
| 570 |
-
<th>
|
| 571 |
</tr>
|
| 572 |
</thead>
|
| 573 |
<tbody style="text-align: center">
|
| 574 |
<tr>
|
| 575 |
-
<
|
|
|
|
| 576 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 577 |
<td></td>
|
| 578 |
<td>7.5</td>
|
|
@@ -583,7 +586,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 583 |
<td>79</td>
|
| 584 |
</tr>
|
| 585 |
<tr>
|
| 586 |
-
<td>
|
| 587 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
|
| 588 |
<td>1.86</td>
|
| 589 |
<td>8.1</td>
|
|
@@ -594,7 +597,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 594 |
<td>148</td>
|
| 595 |
</tr>
|
| 596 |
<tr>
|
| 597 |
-
<td>
|
| 598 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 599 |
<td>2.52</td>
|
| 600 |
<td>6.9</td>
|
|
@@ -605,7 +608,8 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 605 |
<td>221</td>
|
| 606 |
</tr>
|
| 607 |
<tr>
|
| 608 |
-
<
|
|
|
|
| 609 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 610 |
<td></td>
|
| 611 |
<td>4.4</td>
|
|
@@ -616,7 +620,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 616 |
<td>79</td>
|
| 617 |
</tr>
|
| 618 |
<tr>
|
| 619 |
-
<td>
|
| 620 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
|
| 621 |
<td>1.82</td>
|
| 622 |
<td>4.7</td>
|
|
@@ -627,7 +631,7 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 627 |
<td>145</td>
|
| 628 |
</tr>
|
| 629 |
<tr>
|
| 630 |
-
<td>
|
| 631 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 632 |
<td>1.87</td>
|
| 633 |
<td>4.7</td>
|
|
@@ -640,7 +644,9 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 640 |
</tbody>
|
| 641 |
</table>
|
| 642 |
|
|
|
|
| 643 |
|
|
|
|
| 644 |
|
| 645 |
### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
|
| 646 |
|
|
@@ -659,16 +665,16 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 659 |
<th>Model</th>
|
| 660 |
<th>Average Cost Reduction</th>
|
| 661 |
<th>Maximum throughput (QPS)</th>
|
| 662 |
-
<th>
|
| 663 |
<th>Maximum throughput (QPS)</th>
|
| 664 |
-
<th>
|
| 665 |
<th>Maximum throughput (QPS)</th>
|
| 666 |
-
<th>
|
| 667 |
</tr>
|
| 668 |
</thead>
|
| 669 |
<tbody style="text-align: center">
|
| 670 |
<tr>
|
| 671 |
-
<
|
| 672 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 673 |
<td></td>
|
| 674 |
<td>0.4</td>
|
|
@@ -679,29 +685,27 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 679 |
<td>399</td>
|
| 680 |
</tr>
|
| 681 |
<tr>
|
| 682 |
-
<td>A100x2</td>
|
| 683 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
|
| 684 |
<td>1.70</td>
|
| 685 |
-
<td>
|
| 686 |
<td>383</td>
|
| 687 |
-
<td>
|
| 688 |
<td>571</td>
|
| 689 |
-
<td>
|
| 690 |
<td>674</td>
|
| 691 |
</tr>
|
| 692 |
<tr>
|
| 693 |
-
<td>A100x2</td>
|
| 694 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 695 |
<td>1.48</td>
|
| 696 |
-
<td>0.5</td>
|
| 697 |
-
<td>276</td>
|
| 698 |
<td>1.0</td>
|
|
|
|
|
|
|
| 699 |
<td>505</td>
|
| 700 |
-
<td>
|
| 701 |
<td>680</td>
|
| 702 |
</tr>
|
| 703 |
<tr>
|
| 704 |
-
|
| 705 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 706 |
<td></td>
|
| 707 |
<td>1.0</td>
|
|
@@ -712,30 +716,33 @@ The following performance benchmarks were conducted with [vLLM](https://docs.vll
|
|
| 712 |
<td>511</td>
|
| 713 |
</tr>
|
| 714 |
<tr>
|
| 715 |
-
<td>H100x2</td>
|
| 716 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
|
| 717 |
<td>1.61</td>
|
| 718 |
-
<td>
|
| 719 |
<td>467</td>
|
| 720 |
-
<td>2
|
| 721 |
<td>726</td>
|
| 722 |
-
<td>
|
| 723 |
<td>908</td>
|
| 724 |
</tr>
|
| 725 |
<tr>
|
| 726 |
-
<td>H100x2</td>
|
| 727 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 728 |
<td>1.33</td>
|
| 729 |
-
<td>
|
| 730 |
<td>393</td>
|
| 731 |
-
<td>
|
| 732 |
<td>634</td>
|
| 733 |
-
<td>
|
| 734 |
<td>764</td>
|
| 735 |
</tr>
|
| 736 |
</tbody>
|
| 737 |
</table>
|
| 738 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 739 |
|
| 740 |
## The Mistral AI Team
|
| 741 |
|
|
|
|
| 554 |
<th></th>
|
| 555 |
<th></th>
|
| 556 |
<th></th>
|
| 557 |
+
<th></th>
|
| 558 |
<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
|
| 559 |
<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
|
| 560 |
<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
|
| 561 |
</tr>
|
| 562 |
<tr>
|
| 563 |
<th>Hardware</th>
|
| 564 |
+
<th>Number of GPUs</th>
|
| 565 |
<th>Model</th>
|
| 566 |
<th>Average Cost Reduction</th>
|
| 567 |
<th>Latency (s)</th>
|
| 568 |
+
<th>Queries Per Dollar</th>
|
| 569 |
<th>Latency (s)th>
|
| 570 |
+
<th>Queries Per Dollar</th>
|
| 571 |
<th>Latency (s)</th>
|
| 572 |
+
<th>Queries Per Dollar</th>
|
| 573 |
</tr>
|
| 574 |
</thead>
|
| 575 |
<tbody style="text-align: center">
|
| 576 |
<tr>
|
| 577 |
+
<th rowspan="3" valign="top">A100</th>
|
| 578 |
+
<td>4</td>
|
| 579 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 580 |
<td></td>
|
| 581 |
<td>7.5</td>
|
|
|
|
| 586 |
<td>79</td>
|
| 587 |
</tr>
|
| 588 |
<tr>
|
| 589 |
+
<td>2</td>
|
| 590 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
|
| 591 |
<td>1.86</td>
|
| 592 |
<td>8.1</td>
|
|
|
|
| 597 |
<td>148</td>
|
| 598 |
</tr>
|
| 599 |
<tr>
|
| 600 |
+
<td>2</td>
|
| 601 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 602 |
<td>2.52</td>
|
| 603 |
<td>6.9</td>
|
|
|
|
| 608 |
<td>221</td>
|
| 609 |
</tr>
|
| 610 |
<tr>
|
| 611 |
+
<th rowspan="3" valign="top">H100</th>
|
| 612 |
+
<td>4</td>
|
| 613 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 614 |
<td></td>
|
| 615 |
<td>4.4</td>
|
|
|
|
| 620 |
<td>79</td>
|
| 621 |
</tr>
|
| 622 |
<tr>
|
| 623 |
+
<td>2</td>
|
| 624 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
|
| 625 |
<td>1.82</td>
|
| 626 |
<td>4.7</td>
|
|
|
|
| 631 |
<td>145</td>
|
| 632 |
</tr>
|
| 633 |
<tr>
|
| 634 |
+
<td>2</td>
|
| 635 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 636 |
<td>1.87</td>
|
| 637 |
<td>4.7</td>
|
|
|
|
| 644 |
</tbody>
|
| 645 |
</table>
|
| 646 |
|
| 647 |
+
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
|
| 648 |
|
| 649 |
+
**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
|
| 650 |
|
| 651 |
### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)
|
| 652 |
|
|
|
|
| 665 |
<th>Model</th>
|
| 666 |
<th>Average Cost Reduction</th>
|
| 667 |
<th>Maximum throughput (QPS)</th>
|
| 668 |
+
<th>Queries Per Dollar</th>
|
| 669 |
<th>Maximum throughput (QPS)</th>
|
| 670 |
+
<th>Queries Per Dollar</th>
|
| 671 |
<th>Maximum throughput (QPS)</th>
|
| 672 |
+
<th>Queries Per Dollar</th>
|
| 673 |
</tr>
|
| 674 |
</thead>
|
| 675 |
<tbody style="text-align: center">
|
| 676 |
<tr>
|
| 677 |
+
<th rowspan="3" valign="top">A100x4</th>
|
| 678 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 679 |
<td></td>
|
| 680 |
<td>0.4</td>
|
|
|
|
| 685 |
<td>399</td>
|
| 686 |
</tr>
|
| 687 |
<tr>
|
|
|
|
| 688 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
|
| 689 |
<td>1.70</td>
|
| 690 |
+
<td>1.6</td>
|
| 691 |
<td>383</td>
|
| 692 |
+
<td>2.2</td>
|
| 693 |
<td>571</td>
|
| 694 |
+
<td>2.6</td>
|
| 695 |
<td>674</td>
|
| 696 |
</tr>
|
| 697 |
<tr>
|
|
|
|
| 698 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 699 |
<td>1.48</td>
|
|
|
|
|
|
|
| 700 |
<td>1.0</td>
|
| 701 |
+
<td>276</td>
|
| 702 |
+
<td>2.0</td>
|
| 703 |
<td>505</td>
|
| 704 |
+
<td>2.8</td>
|
| 705 |
<td>680</td>
|
| 706 |
</tr>
|
| 707 |
<tr>
|
| 708 |
+
<<th rowspan="3" valign="top">H100x4</th>
|
| 709 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf</td>
|
| 710 |
<td></td>
|
| 711 |
<td>1.0</td>
|
|
|
|
| 716 |
<td>511</td>
|
| 717 |
</tr>
|
| 718 |
<tr>
|
|
|
|
| 719 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
|
| 720 |
<td>1.61</td>
|
| 721 |
+
<td>3.4</td>
|
| 722 |
<td>467</td>
|
| 723 |
+
<td>5.2</td>
|
| 724 |
<td>726</td>
|
| 725 |
+
<td>6.4</td>
|
| 726 |
<td>908</td>
|
| 727 |
</tr>
|
| 728 |
<tr>
|
|
|
|
| 729 |
<td>nm-testing/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
|
| 730 |
<td>1.33</td>
|
| 731 |
+
<td>2.8</td>
|
| 732 |
<td>393</td>
|
| 733 |
+
<td>4.4</td>
|
| 734 |
<td>634</td>
|
| 735 |
+
<td>5.4</td>
|
| 736 |
<td>764</td>
|
| 737 |
</tr>
|
| 738 |
</tbody>
|
| 739 |
</table>
|
| 740 |
|
| 741 |
+
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens
|
| 742 |
+
|
| 743 |
+
**QPS: Queries per second.
|
| 744 |
+
|
| 745 |
+
**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).
|
| 746 |
|
| 747 |
## The Mistral AI Team
|
| 748 |
|