|
|
--- |
|
|
tags: |
|
|
- w8a8 |
|
|
- int8 |
|
|
- vllm |
|
|
- vision |
|
|
|
|
|
license: other |
|
|
license_name: mrl |
|
|
inference: false |
|
|
license_link: https://mistral.ai/licenses/MRL-0.1.md |
|
|
extra_gated_prompt: >- |
|
|
# Mistral AI Research License |
|
|
|
|
|
If You want to use a Mistral Model, a Derivative or an Output for any purpose |
|
|
that is not expressly authorized under this Agreement, You must request a |
|
|
license from Mistral AI, which Mistral AI may grant to You in Mistral AI's |
|
|
sole discretion. To discuss such a license, please contact Mistral AI via the |
|
|
website contact form: https://mistral.ai/contact/ |
|
|
|
|
|
|
|
|
|
|
|
**1.1. Scope of the Agreement.** This Agreement applies to any use, |
|
|
modification, or Distribution of any Mistral Model by You, regardless of the |
|
|
source You obtained a copy of such Mistral Model. |
|
|
|
|
|
**1.2. Acceptance.** By accessing, using, modifying, Distributing a Mistral |
|
|
Model, or by creating, using or distributing a Derivative of the Mistral |
|
|
Model, You agree to be bound by this Agreement. |
|
|
|
|
|
**1.3. Acceptance on behalf of a third-party.** If You accept this Agreement |
|
|
on behalf of Your employer or another person or entity, You warrant and |
|
|
represent that You have the authority to act and accept this Agreement on |
|
|
their behalf. In such a case, the word "You" in this Agreement will refer to |
|
|
Your employer or such other person or entity. |
|
|
|
|
|
|
|
|
|
|
|
**2.1. Grant of rights**. Subject to Section 3 below, Mistral AI hereby |
|
|
grants You a non-exclusive, royalty-free, worldwide, non-sublicensable, |
|
|
non-transferable, limited license to use, copy, modify, and Distribute under |
|
|
the conditions provided in Section 2.2 below, the Mistral Model and any |
|
|
Derivatives made by or for Mistral AI and to create Derivatives of the Mistral |
|
|
Model. |
|
|
|
|
|
**2.2. Distribution of Mistral Model and Derivatives made by or for Mistral |
|
|
AI.** Subject to Section 3 below, You may Distribute copies of the Mistral |
|
|
Model and/or Derivatives made by or for Mistral AI, under the following |
|
|
conditions: You must make available a copy of this Agreement to third-party |
|
|
recipients of the Mistral Models and/or Derivatives made by or for Mistral AI |
|
|
you Distribute, it being specified that any rights to use the Mistral Models |
|
|
and/or Derivatives made by or for Mistral AI shall be directly granted by |
|
|
Mistral AI to said third-party recipients pursuant to the Mistral AI Research |
|
|
License agreement executed between these parties; You must retain in all |
|
|
copies of the Mistral Models the following attribution notice within a |
|
|
"Notice" text file distributed as part of such copies: "Licensed by Mistral AI |
|
|
under the Mistral AI Research License". |
|
|
|
|
|
**2.3. Distribution of Derivatives made by or for You.** Subject to Section 3 |
|
|
below, You may Distribute any Derivatives made by or for You under additional |
|
|
or different terms and conditions, provided that: In any event, the use and |
|
|
modification of Mistral Model and/or Derivatives made by or for Mistral AI |
|
|
shall remain governed by the terms and conditions of this Agreement; You |
|
|
include in any such Derivatives made by or for You prominent notices stating |
|
|
that You modified the concerned Mistral Model; and Any terms and conditions |
|
|
You impose on any third-party recipients relating to Derivatives made by or |
|
|
for You shall neither limit such third-party recipients' use of the Mistral |
|
|
Model or any Derivatives made by or for Mistral AI in accordance with the |
|
|
Mistral AI Research License nor conflict with any of its terms and conditions. |
|
|
|
|
|
|
|
|
|
|
|
**3.1. Misrepresentation.** You must not misrepresent or imply, through any |
|
|
means, that the Derivatives made by or for You and/or any modified version of |
|
|
the Mistral Model You Distribute under your name and responsibility is an |
|
|
official product of Mistral AI or has been endorsed, approved or validated by |
|
|
Mistral AI, unless You are authorized by Us to do so in writing. |
|
|
|
|
|
**3.2. Usage Limitation.** You shall only use the Mistral Models, Derivatives |
|
|
(whether or not created by Mistral AI) and Outputs for Research Purposes. |
|
|
|
|
|
|
|
|
|
|
|
**4.1. Trademarks.** No trademark licenses are granted under this Agreement, |
|
|
and in connection with the Mistral Models, You may not use any name or mark |
|
|
owned by or associated with Mistral AI or any of its affiliates, except (i) as |
|
|
required for reasonable and customary use in describing and Distributing the |
|
|
Mistral Models and Derivatives made by or for Mistral AI and (ii) for |
|
|
attribution purposes as required by this Agreement. |
|
|
|
|
|
**4.2. Outputs.** We claim no ownership rights in and to the Outputs. You are |
|
|
solely responsible for the Outputs You generate and their subsequent uses in |
|
|
accordance with this Agreement. Any Outputs shall be subject to the |
|
|
restrictions set out in Section 3 of this Agreement. |
|
|
|
|
|
**4.3. Derivatives.** By entering into this Agreement, You accept that any |
|
|
Derivatives that You may create or that may be created for You shall be |
|
|
subject to the restrictions set out in Section 3 of this Agreement. |
|
|
|
|
|
|
|
|
|
|
|
**5.1. Limitation of liability.** In no event, unless required by applicable |
|
|
law (such as deliberate and grossly negligent acts) or agreed to in writing, |
|
|
shall Mistral AI be liable to You for damages, including any direct, indirect, |
|
|
special, incidental, or consequential damages of any character arising as a |
|
|
result of this Agreement or out of the use or inability to use the Mistral |
|
|
Models and Derivatives (including but not limited to damages for loss of data, |
|
|
loss of goodwill, loss of expected profit or savings, work stoppage, computer |
|
|
failure or malfunction, or any damage caused by malware or security breaches), |
|
|
even if Mistral AI has been advised of the possibility of such damages. |
|
|
|
|
|
**5.2. Indemnification.** You agree to indemnify and hold harmless Mistral AI |
|
|
from and against any claims, damages, or losses arising out of or related to |
|
|
Your use or Distribution of the Mistral Models and Derivatives. |
|
|
|
|
|
|
|
|
|
|
|
**6.1. Disclaimer.** Unless required by applicable law or prior agreed to by |
|
|
Mistral AI in writing, Mistral AI provides the Mistral Models and Derivatives |
|
|
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either |
|
|
express or implied, including, without limitation, any warranties or |
|
|
conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A |
|
|
PARTICULAR PURPOSE. Mistral AI does not represent nor warrant that the Mistral |
|
|
Models and Derivatives will be error-free, meet Your or any third party's |
|
|
requirements, be secure or will allow You or any third party to achieve any |
|
|
kind of result or generate any kind of content. You are solely responsible for |
|
|
determining the appropriateness of using or Distributing the Mistral Models |
|
|
and Derivatives and assume any risks associated with Your exercise of rights |
|
|
under this Agreement. |
|
|
|
|
|
|
|
|
|
|
|
**7.1. Term.** This Agreement is effective as of the date of your acceptance |
|
|
of this Agreement or access to the concerned Mistral Models or Derivatives and |
|
|
will continue until terminated in accordance with the following terms. |
|
|
|
|
|
**7.2. Termination.** Mistral AI may terminate this Agreement at any time if |
|
|
You are in breach of this Agreement. Upon termination of this Agreement, You |
|
|
must cease to use all Mistral Models and Derivatives and shall permanently |
|
|
delete any copy thereof. The following provisions, in their relevant parts, |
|
|
will survive any termination or expiration of this Agreement, each for the |
|
|
duration necessary to achieve its own intended purpose (e.g. the liability |
|
|
provision will survive until the end of the applicable limitation |
|
|
period):Sections 5 (Liability), 6(Warranty), 7 (Termination) and 8 (General |
|
|
Provisions). |
|
|
|
|
|
**7.3. Litigation.** If You initiate any legal action or proceedings against |
|
|
Us or any other entity (including a cross-claim or counterclaim in a lawsuit), |
|
|
alleging that the Model or a Derivative, or any part thereof, infringe upon |
|
|
intellectual property or other rights owned or licensable by You, then any |
|
|
licenses granted to You under this Agreement will immediately terminate as of |
|
|
the date such legal action or claim is filed or initiated. |
|
|
|
|
|
|
|
|
|
|
|
**8.1. Governing laws.** This Agreement will be governed by the laws of |
|
|
France, without regard to choice of law principles, and the UN Convention on |
|
|
Contracts for the International Sale of Goods does not apply to this |
|
|
Agreement. |
|
|
|
|
|
**8.2. Competent jurisdiction.** The courts of Paris shall have exclusive |
|
|
jurisdiction of any dispute arising out of this Agreement. |
|
|
|
|
|
**8.3. Severability.** If any provision of this Agreement is held to be |
|
|
invalid, illegal or unenforceable, the remaining provisions shall be |
|
|
unaffected thereby and remain valid as if such provision had not been set |
|
|
forth herein. |
|
|
|
|
|
|
|
|
|
|
|
"Agreement": means this Mistral AI Research License agreement governing the |
|
|
access, use, and Distribution of the Mistral Models, Derivatives and Outputs. |
|
|
|
|
|
"Derivative": means any (i) modified version of the Mistral Model (including |
|
|
but not limited to any customized or fine-tuned version thereof), (ii) work |
|
|
based on the Mistral Model, or (iii) any other derivative work thereof. |
|
|
|
|
|
"Distribution", "Distributing", "Distribute" or "Distributed": means |
|
|
supplying, providing or making available, by any means, a copy of the Mistral |
|
|
Models and/or the Derivatives as the case may be, subject to Section 3 of this |
|
|
Agreement. |
|
|
|
|
|
"Mistral AI", "We" or "Us": means Mistral AI, a French société par actions |
|
|
simplifiée registered in the Paris commercial registry under the number 952 |
|
|
418 325, and having its registered seat at 15, rue des Halles, 75001 Paris. |
|
|
|
|
|
"Mistral Model": means the foundational large language model(s), and its |
|
|
elements which include algorithms, software, instructed checkpoints, |
|
|
parameters, source code (inference code, evaluation code and, if applicable, |
|
|
fine-tuning code) and any other elements associated thereto made available by |
|
|
Mistral AI under this Agreement, including, if any, the technical |
|
|
documentation, manuals and instructions for the use and operation thereof. |
|
|
|
|
|
"Research Purposes": means any use of a Mistral Model, Derivative, or Output |
|
|
that is solely for (a) personal, scientific or academic research, and (b) for |
|
|
non-profit and non-commercial purposes, and not directly or indirectly |
|
|
connected to any commercial activities or business operations. For |
|
|
illustration purposes, Research Purposes does not include (1) any usage of the |
|
|
Mistral Model, Derivative or Output by individuals or contractors employed in |
|
|
or engaged by companies in the context of (a) their daily tasks, or (b) any |
|
|
activity (including but not limited to any testing or proof-of-concept) that |
|
|
is intended to generate revenue, nor (2) any Distribution by a commercial |
|
|
entity of the Mistral Model, Derivative or Output whether in return for |
|
|
payment or free of charge, in any medium or form, including but not limited to |
|
|
through a hosted or managed service (e.g. SaaS, cloud instances, etc.), or |
|
|
behind a software layer. |
|
|
|
|
|
"Outputs": means any content generated by the operation of the Mistral Models |
|
|
or the Derivatives from a prompt (i.e., text instructions) provided by users. |
|
|
For the avoidance of doubt, Outputs do not include any components of a Mistral |
|
|
Models, such as any fine-tuned versions of the Mistral Models, the weights, or |
|
|
parameters. |
|
|
|
|
|
"You": means the individual or entity entering into this Agreement with |
|
|
Mistral AI. |
|
|
|
|
|
|
|
|
*Mistral AI processes your personal data below to provide the model and |
|
|
enforce its license. If you are affiliated with a commercial entity, we may |
|
|
also send you communications about our models. For more information on your |
|
|
rights and data handling, please see our <a |
|
|
href="https://mistral.ai/terms/">privacy policy</a>.* |
|
|
extra_gated_fields: |
|
|
First Name: text |
|
|
Last Name: text |
|
|
Country: country |
|
|
Affiliation: text |
|
|
Job title: text |
|
|
I understand that I can only use the model, any derivative versions and their outputs for non-commercial research purposes: checkbox |
|
|
I understand that if I am a commercial entity, I am not permitted to use or distribute the model internally or externally, or expose it in my own offerings without a commercial license: checkbox |
|
|
I understand that if I upload the model, or any derivative version, on any platform, I must include the Mistral Research License: checkbox |
|
|
I understand that for commercial use of the model, I can contact Mistral or use the Mistral AI API on la Plateforme or any of our cloud provider partners: checkbox |
|
|
By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Mistral Privacy Policy: checkbox |
|
|
geo: ip_location |
|
|
extra_gated_description: >- |
|
|
Mistral AI processes your personal data below to provide the model and enforce |
|
|
its license. If you are affiliated with a commercial entity, we may also send |
|
|
you communications about our models. For more information on your rights and |
|
|
data handling, please see our <a href="https://mistral.ai/terms/">privacy |
|
|
policy</a>. |
|
|
extra_gated_button_content: Submit |
|
|
library_name: vllm |
|
|
pipeline_tag: image-text-to-text |
|
|
language: |
|
|
- en |
|
|
- fr |
|
|
- de |
|
|
- es |
|
|
- it |
|
|
- pt |
|
|
- zh |
|
|
- ja |
|
|
- ru |
|
|
- ko |
|
|
base_model: neuralmagic/Pixtral-Large-Instruct-2411-hf |
|
|
--- |
|
|
|
|
|
# Pixtral-Large-Instruct-2411-hf-quantized.w8a8 |
|
|
|
|
|
## Model Overview |
|
|
- **Model Architecture:** neuralmagic/Pixtral-Large-Instruct-2411-hf |
|
|
- **Input:** Vision-Text |
|
|
- **Output:** Text |
|
|
- **Model Optimizations:** |
|
|
- **Weight quantization:** INT8 |
|
|
- **Activation quantization:** INT8 |
|
|
- **Release Date:** 2/24/2025 |
|
|
- **Version:** 1.0 |
|
|
- **Model Developers:** Neural Magic |
|
|
|
|
|
Quantized version of [neuralmagic/Pixtral-Large-Instruct-2411-hf](https://huggingface.co/neuralmagic/Pixtral-Large-Instruct-2411-hf/tree/main). |
|
|
|
|
|
### Model Optimizations |
|
|
|
|
|
This model was obtained by quantizing the weights of [neuralmagic/Pixtral-Large-Instruct-2411-hf](https://huggingface.co/neuralmagic/Pixtral-Large-Instruct-2411-hf/tree/main) to INT8 data type, ready for inference with vLLM >= 0.5.2. |
|
|
|
|
|
## Deployment |
|
|
|
|
|
### Use with vLLM |
|
|
|
|
|
This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. |
|
|
|
|
|
```python |
|
|
from vllm.assets.image import ImageAsset |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
# prepare model |
|
|
llm = LLM( |
|
|
model="neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8", |
|
|
trust_remote_code=True, |
|
|
max_model_len=4096, |
|
|
max_num_seqs=2, |
|
|
) |
|
|
|
|
|
# prepare inputs |
|
|
question = "What is the content of this image?" |
|
|
inputs = { |
|
|
"prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n", |
|
|
"multi_modal_data": { |
|
|
"image": ImageAsset("cherry_blossom").pil_image.convert("RGB") |
|
|
}, |
|
|
} |
|
|
|
|
|
# generate response |
|
|
print("========== SAMPLE GENERATION ==============") |
|
|
outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64)) |
|
|
print(f"PROMPT : {outputs[0].prompt}") |
|
|
print(f"RESPONSE: {outputs[0].outputs[0].text}") |
|
|
print("==========================================") |
|
|
``` |
|
|
|
|
|
vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. |
|
|
|
|
|
## Creation |
|
|
|
|
|
This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below as part a multimodal announcement blog. |
|
|
|
|
|
<details> |
|
|
<summary>Model Creation Code</summary> |
|
|
|
|
|
```python |
|
|
import requests |
|
|
import torch |
|
|
from PIL import Image |
|
|
from transformers import AutoProcessor |
|
|
from llmcompressor.modifiers.quantization import GPTQModifier |
|
|
from llmcompressor.transformers import oneshot |
|
|
from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration |
|
|
|
|
|
# Load model. |
|
|
model_id = "neuralmagic/Pixtral-Large-Instruct-2411-hf" |
|
|
model = TraceableLlavaForConditionalGeneration.from_pretrained( |
|
|
model_id, device_map="auto", torch_dtype="auto" |
|
|
) |
|
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
|
|
|
|
# Oneshot arguments |
|
|
DATASET_ID = "flickr30k" |
|
|
DATASET_SPLIT = {"calibration": "test[:512]"} |
|
|
NUM_CALIBRATION_SAMPLES = 512 |
|
|
MAX_SEQUENCE_LENGTH = 2048 |
|
|
|
|
|
|
|
|
# Define a oneshot data collator for multimodal inputs. |
|
|
def data_collator(batch): |
|
|
assert len(batch) == 1 |
|
|
return { |
|
|
"input_ids": torch.LongTensor(batch[0]["input_ids"]), |
|
|
"attention_mask": torch.tensor(batch[0]["attention_mask"]), |
|
|
"pixel_values": torch.tensor(batch[0]["pixel_values"]), |
|
|
} |
|
|
|
|
|
|
|
|
# Recipe |
|
|
recipe = [ |
|
|
GPTQModifier( |
|
|
targets="Linear", |
|
|
scheme="W8A8", |
|
|
sequential_targets=["MistralDecoderLayer"], |
|
|
ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"], |
|
|
), |
|
|
] |
|
|
|
|
|
SAVE_DIR==f"{model_id.split('/')[1]}-quantized.w8a8" |
|
|
|
|
|
# Perform oneshot |
|
|
oneshot( |
|
|
model=model, |
|
|
tokenizer=model_id, |
|
|
dataset=DATASET_ID, |
|
|
splits=DATASET_SPLIT, |
|
|
recipe=recipe, |
|
|
max_seq_length=MAX_SEQUENCE_LENGTH, |
|
|
num_calibration_samples=NUM_CALIBRATION_SAMPLES, |
|
|
trust_remote_code_model=True, |
|
|
data_collator=data_collator, |
|
|
output_dir=SAVE_DIR |
|
|
) |
|
|
``` |
|
|
</details> |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands: |
|
|
|
|
|
<details> |
|
|
<summary>Evaluation Commands</summary> |
|
|
|
|
|
### Vision Tasks |
|
|
- vqav2 |
|
|
- docvqa |
|
|
- mathvista |
|
|
- mmmu |
|
|
- chartqa |
|
|
|
|
|
``` |
|
|
vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7 |
|
|
|
|
|
python -m eval.run eval_vllm \ |
|
|
--model_name neuralmagic/pixtral-12b-quantized.w8a8 \ |
|
|
--url http://0.0.0.0:8000 \ |
|
|
--output_dir ~/tmp \ |
|
|
--eval_name <vision_task_name> |
|
|
``` |
|
|
|
|
|
### Text-based Tasks |
|
|
#### MMLU |
|
|
|
|
|
``` |
|
|
lm_eval \ |
|
|
--model vllm \ |
|
|
--model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \ |
|
|
--tasks mmlu \ |
|
|
--num_fewshot 5 \ |
|
|
--batch_size auto \ |
|
|
--output_path output_dir |
|
|
|
|
|
``` |
|
|
|
|
|
#### MGSM |
|
|
|
|
|
``` |
|
|
lm_eval \ |
|
|
--model vllm \ |
|
|
--model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \ |
|
|
--tasks mgsm_cot_native \ |
|
|
--apply_chat_template \ |
|
|
--num_fewshot 0 \ |
|
|
--batch_size auto \ |
|
|
--output_path output_dir |
|
|
|
|
|
``` |
|
|
</details> |
|
|
|
|
|
|
|
|
### Accuracy |
|
|
|
|
|
<table> |
|
|
<thead> |
|
|
<tr> |
|
|
<th>Category</th> |
|
|
<th>Metric</th> |
|
|
<th>neuralmagic/Pixtral-Large-Instruct-2411-hf</th> |
|
|
<th>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</th> |
|
|
<th>Recovery (%)</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody> |
|
|
<tr> |
|
|
<td rowspan="6"><b>Vision</b></td> |
|
|
<td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td> |
|
|
<td>63.56</td> |
|
|
<td>63.89</td> |
|
|
<td>100.52%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>VQAv2 (val)<br><i>vqa_match</i></td> |
|
|
<td>79.03</td> |
|
|
<td>79.12</td> |
|
|
<td>100.11%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>DocVQA (val)<br><i>anls</i></td> |
|
|
<td>89.55</td> |
|
|
<td>89.80</td> |
|
|
<td>100.28%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td> |
|
|
<td>82.24</td> |
|
|
<td>80.44</td> |
|
|
<td>97.81%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td> |
|
|
<td>67.3</td> |
|
|
<td>66.50</td> |
|
|
<td>98.81%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><b>Average Score</b></td> |
|
|
<td><b>76.34</b></td> |
|
|
<td><b>75.95</b></td> |
|
|
<td><b>99.49%</b></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td rowspan="2"><b>Text</b></td> |
|
|
<td>MGSM (CoT)</td> |
|
|
<td>76.05</td> |
|
|
<td>74.76</td> |
|
|
<td>98.30%</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>MMLU (5-shot)</td> |
|
|
<td>82.8</td> |
|
|
<td>82.9</td> |
|
|
<td>100.12%</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
|
|
|
## Inference Performance |
|
|
|
|
|
|
|
|
This model achieves up to 1.87x speedup in single-stream deployment and up to 2.0x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario. |
|
|
The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.7.2, and [GuideLLM](https://github.com/neuralmagic/guidellm). |
|
|
|
|
|
<details> |
|
|
<summary>Benchmarking Command</summary> |
|
|
``` |
|
|
guidellm --model neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>,images=<num_images>,width=<image_width>,height=<image_height> --max seconds 120 --backend aiohttp_server |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
### Single-stream performance (measured with vLLM version 0.7.2) |
|
|
|
|
|
<table border="1" class="dataframe"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th> |
|
|
<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th> |
|
|
<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th>Hardware</th> |
|
|
<th>Number of GPUs</th> |
|
|
<th>Model</th> |
|
|
<th>Average Cost Reduction</th> |
|
|
<th>Latency (s)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
<th>Latency (s)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
<th>Latency (s)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody style="text-align: center"> |
|
|
<tr> |
|
|
<th rowspan="3" valign="top">A100</th> |
|
|
<td>4</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td> |
|
|
<td></td> |
|
|
<td>7.5</td> |
|
|
<td>67</td> |
|
|
<td>6.5</td> |
|
|
<td>77</td> |
|
|
<td>6.4</td> |
|
|
<td>79</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>2</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td> |
|
|
<td>1.86</td> |
|
|
<td>8.1</td> |
|
|
<td>124</td> |
|
|
<td>7.1</td> |
|
|
<td>142</td> |
|
|
<td>6.8</td> |
|
|
<td>148</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>2</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td> |
|
|
<td>2.52</td> |
|
|
<td>6.9</td> |
|
|
<td>147</td> |
|
|
<td>5.1</td> |
|
|
<td>199</td> |
|
|
<td>4.5</td> |
|
|
<td>221</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<th rowspan="3" valign="top">H100</th> |
|
|
<td>4</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td> |
|
|
<td></td> |
|
|
<td>4.4</td> |
|
|
<td>67</td> |
|
|
<td>3.9</td> |
|
|
<td>74</td> |
|
|
<td>3.7</td> |
|
|
<td>79</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>2</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td> |
|
|
<td>1.82</td> |
|
|
<td>4.7</td> |
|
|
<td>120</td> |
|
|
<td>4.1</td> |
|
|
<td>137</td> |
|
|
<td>3.9</td> |
|
|
<td>145</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>2</td> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td> |
|
|
<td>1.87</td> |
|
|
<td>4.7</td> |
|
|
<td>120</td> |
|
|
<td>3.9</td> |
|
|
<td>144</td> |
|
|
<td>3.8</td> |
|
|
<td>149</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens |
|
|
|
|
|
**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025). |
|
|
|
|
|
### Multi-stream asynchronous performance (measured with vLLM version 0.7.2) |
|
|
|
|
|
<table border="1" class="dataframe"> |
|
|
<thead> |
|
|
<tr> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th></th> |
|
|
<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th> |
|
|
<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th> |
|
|
<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<th>Hardware</th> |
|
|
<th>Model</th> |
|
|
<th>Average Cost Reduction</th> |
|
|
<th>Maximum throughput (QPS)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
<th>Maximum throughput (QPS)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
<th>Maximum throughput (QPS)</th> |
|
|
<th>Queries Per Dollar</th> |
|
|
</tr> |
|
|
</thead> |
|
|
<tbody style="text-align: center"> |
|
|
<tr> |
|
|
<th rowspan="3" valign="top">A100x4</th> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td> |
|
|
<td></td> |
|
|
<td>0.4</td> |
|
|
<td>222</td> |
|
|
<td>0.7</td> |
|
|
<td>341</td> |
|
|
<td>0.8</td> |
|
|
<td>399</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td> |
|
|
<td>1.70</td> |
|
|
<td>0.8</td> |
|
|
<td>383</td> |
|
|
<td>1.1</td> |
|
|
<td>571</td> |
|
|
<td>1.3</td> |
|
|
<td>674</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td> |
|
|
<td>1.48</td> |
|
|
<td>0.5</td> |
|
|
<td>276</td> |
|
|
<td>1.0</td> |
|
|
<td>505</td> |
|
|
<td>1.4</td> |
|
|
<td>680</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<<th rowspan="3" valign="top">H100x4</th> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td> |
|
|
<td></td> |
|
|
<td>1.0</td> |
|
|
<td>284</td> |
|
|
<td>1.6</td> |
|
|
<td>465</td> |
|
|
<td>1.8</td> |
|
|
<td>511</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td> |
|
|
<td>1.61</td> |
|
|
<td>1.7</td> |
|
|
<td>467</td> |
|
|
<td>2.6</td> |
|
|
<td>726</td> |
|
|
<td>3.2</td> |
|
|
<td>908</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td> |
|
|
<td>1.33</td> |
|
|
<td>1.4</td> |
|
|
<td>393</td> |
|
|
<td>2.2</td> |
|
|
<td>726</td> |
|
|
<td>2.7</td> |
|
|
<td>764</td> |
|
|
</tr> |
|
|
</tbody> |
|
|
</table> |
|
|
|
|
|
**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens |
|
|
|
|
|
**QPS: Queries per second. |
|
|
|
|
|
**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025). |
|
|
|