Update README.md

06bc6fd verified 9 months ago

26.2 kB

	---
	tags:
	- w8a8
	- int8
	- vllm
	- vision

	license: other
	license_name: mrl
	inference: false
	license_link: https://mistral.ai/licenses/MRL-0.1.md
	extra_gated_prompt: >-
	# Mistral AI Research License

	If You want to use a Mistral Model, a Derivative or an Output for any purpose
	that is not expressly authorized under this Agreement, You must request a
	license from Mistral AI, which Mistral AI may grant to You in Mistral AI's
	sole discretion. To discuss such a license, please contact Mistral AI via the
	website contact form: https://mistral.ai/contact/

	## 1. Scope and acceptance

	1.1. Scope of the Agreement. This Agreement applies to any use,
	modification, or Distribution of any Mistral Model by You, regardless of the
	source You obtained a copy of such Mistral Model.

	1.2. Acceptance. By accessing, using, modifying, Distributing a Mistral
	Model, or by creating, using or distributing a Derivative of the Mistral
	Model, You agree to be bound by this Agreement.

	1.3. Acceptance on behalf of a third-party. If You accept this Agreement
	on behalf of Your employer or another person or entity, You warrant and
	represent that You have the authority to act and accept this Agreement on
	their behalf. In such a case, the word "You" in this Agreement will refer to
	Your employer or such other person or entity.

	## 2. License

	2.1. Grant of rights. Subject to Section 3 below, Mistral AI hereby
	grants You a non-exclusive, royalty-free, worldwide, non-sublicensable,
	non-transferable, limited license to use, copy, modify, and Distribute under
	the conditions provided in Section 2.2 below, the Mistral Model and any
	Derivatives made by or for Mistral AI and to create Derivatives of the Mistral
	Model.

	**2.2. Distribution of Mistral Model and Derivatives made by or for Mistral
	AI.** Subject to Section 3 below, You may Distribute copies of the Mistral
	Model and/or Derivatives made by or for Mistral AI, under the following
	conditions: You must make available a copy of this Agreement to third-party
	recipients of the Mistral Models and/or Derivatives made by or for Mistral AI
	you Distribute, it being specified that any rights to use the Mistral Models
	and/or Derivatives made by or for Mistral AI shall be directly granted by
	Mistral AI to said third-party recipients pursuant to the Mistral AI Research
	License agreement executed between these parties; You must retain in all
	copies of the Mistral Models the following attribution notice within a
	"Notice" text file distributed as part of such copies: "Licensed by Mistral AI
	under the Mistral AI Research License".

	2.3. Distribution of Derivatives made by or for You. Subject to Section 3
	below, You may Distribute any Derivatives made by or for You under additional
	or different terms and conditions, provided that: In any event, the use and
	modification of Mistral Model and/or Derivatives made by or for Mistral AI
	shall remain governed by the terms and conditions of this Agreement; You
	include in any such Derivatives made by or for You prominent notices stating
	that You modified the concerned Mistral Model; and Any terms and conditions
	You impose on any third-party recipients relating to Derivatives made by or
	for You shall neither limit such third-party recipients' use of the Mistral
	Model or any Derivatives made by or for Mistral AI in accordance with the
	Mistral AI Research License nor conflict with any of its terms and conditions.

	## 3. Limitations

	3.1. Misrepresentation. You must not misrepresent or imply, through any
	means, that the Derivatives made by or for You and/or any modified version of
	the Mistral Model You Distribute under your name and responsibility is an
	official product of Mistral AI or has been endorsed, approved or validated by
	Mistral AI, unless You are authorized by Us to do so in writing.

	3.2. Usage Limitation. You shall only use the Mistral Models, Derivatives
	(whether or not created by Mistral AI) and Outputs for Research Purposes.

	## 4. Intellectual Property

	4.1. Trademarks. No trademark licenses are granted under this Agreement,
	and in connection with the Mistral Models, You may not use any name or mark
	owned by or associated with Mistral AI or any of its affiliates, except (i) as
	required for reasonable and customary use in describing and Distributing the
	Mistral Models and Derivatives made by or for Mistral AI and (ii) for
	attribution purposes as required by this Agreement.

	4.2. Outputs. We claim no ownership rights in and to the Outputs. You are
	solely responsible for the Outputs You generate and their subsequent uses in
	accordance with this Agreement. Any Outputs shall be subject to the
	restrictions set out in Section 3 of this Agreement.

	4.3. Derivatives. By entering into this Agreement, You accept that any
	Derivatives that You may create or that may be created for You shall be
	subject to the restrictions set out in Section 3 of this Agreement.

	## 5. Liability

	5.1. Limitation of liability. In no event, unless required by applicable
	law (such as deliberate and grossly negligent acts) or agreed to in writing,
	shall Mistral AI be liable to You for damages, including any direct, indirect,
	special, incidental, or consequential damages of any character arising as a
	result of this Agreement or out of the use or inability to use the Mistral
	Models and Derivatives (including but not limited to damages for loss of data,
	loss of goodwill, loss of expected profit or savings, work stoppage, computer
	failure or malfunction, or any damage caused by malware or security breaches),
	even if Mistral AI has been advised of the possibility of such damages.

	5.2. Indemnification. You agree to indemnify and hold harmless Mistral AI
	from and against any claims, damages, or losses arising out of or related to
	Your use or Distribution of the Mistral Models and Derivatives.

	## 6. Warranty

	6.1. Disclaimer. Unless required by applicable law or prior agreed to by
	Mistral AI in writing, Mistral AI provides the Mistral Models and Derivatives
	on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
	express or implied, including, without limitation, any warranties or
	conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
	PARTICULAR PURPOSE. Mistral AI does not represent nor warrant that the Mistral
	Models and Derivatives will be error-free, meet Your or any third party's
	requirements, be secure or will allow You or any third party to achieve any
	kind of result or generate any kind of content. You are solely responsible for
	determining the appropriateness of using or Distributing the Mistral Models
	and Derivatives and assume any risks associated with Your exercise of rights
	under this Agreement.

	## 7. Termination

	7.1. Term. This Agreement is effective as of the date of your acceptance
	of this Agreement or access to the concerned Mistral Models or Derivatives and
	will continue until terminated in accordance with the following terms.

	7.2. Termination. Mistral AI may terminate this Agreement at any time if
	You are in breach of this Agreement. Upon termination of this Agreement, You
	must cease to use all Mistral Models and Derivatives and shall permanently
	delete any copy thereof. The following provisions, in their relevant parts,
	will survive any termination or expiration of this Agreement, each for the
	duration necessary to achieve its own intended purpose (e.g. the liability
	provision will survive until the end of the applicable limitation
	period):Sections 5 (Liability), 6(Warranty), 7 (Termination) and 8 (General
	Provisions).

	7.3. Litigation. If You initiate any legal action or proceedings against
	Us or any other entity (including a cross-claim or counterclaim in a lawsuit),
	alleging that the Model or a Derivative, or any part thereof, infringe upon
	intellectual property or other rights owned or licensable by You, then any
	licenses granted to You under this Agreement will immediately terminate as of
	the date such legal action or claim is filed or initiated.

	## 8. General provisions

	8.1. Governing laws. This Agreement will be governed by the laws of
	France, without regard to choice of law principles, and the UN Convention on
	Contracts for the International Sale of Goods does not apply to this
	Agreement.

	8.2. Competent jurisdiction. The courts of Paris shall have exclusive
	jurisdiction of any dispute arising out of this Agreement.

	8.3. Severability. If any provision of this Agreement is held to be
	invalid, illegal or unenforceable, the remaining provisions shall be
	unaffected thereby and remain valid as if such provision had not been set
	forth herein.

	## 9. Definitions

	"Agreement": means this Mistral AI Research License agreement governing the
	access, use, and Distribution of the Mistral Models, Derivatives and Outputs.

	"Derivative": means any (i) modified version of the Mistral Model (including
	but not limited to any customized or fine-tuned version thereof), (ii) work
	based on the Mistral Model, or (iii) any other derivative work thereof.

	"Distribution", "Distributing", "Distribute" or "Distributed": means
	supplying, providing or making available, by any means, a copy of the Mistral
	Models and/or the Derivatives as the case may be, subject to Section 3 of this
	Agreement.

	"Mistral AI", "We" or "Us": means Mistral AI, a French société par actions
	simplifiée registered in the Paris commercial registry under the number 952
	418 325, and having its registered seat at 15, rue des Halles, 75001 Paris.

	"Mistral Model": means the foundational large language model(s), and its
	elements which include algorithms, software, instructed checkpoints,
	parameters, source code (inference code, evaluation code and, if applicable,
	fine-tuning code) and any other elements associated thereto made available by
	Mistral AI under this Agreement, including, if any, the technical
	documentation, manuals and instructions for the use and operation thereof.

	"Research Purposes": means any use of a Mistral Model, Derivative, or Output
	that is solely for (a) personal, scientific or academic research, and (b) for
	non-profit and non-commercial purposes, and not directly or indirectly
	connected to any commercial activities or business operations. For
	illustration purposes, Research Purposes does not include (1) any usage of the
	Mistral Model, Derivative or Output by individuals or contractors employed in
	or engaged by companies in the context of (a) their daily tasks, or (b) any
	activity (including but not limited to any testing or proof-of-concept) that
	is intended to generate revenue, nor (2) any Distribution by a commercial
	entity of the Mistral Model, Derivative or Output whether in return for
	payment or free of charge, in any medium or form, including but not limited to
	through a hosted or managed service (e.g. SaaS, cloud instances, etc.), or
	behind a software layer.

	"Outputs": means any content generated by the operation of the Mistral Models
	or the Derivatives from a prompt (i.e., text instructions) provided by users.
	For the avoidance of doubt, Outputs do not include any components of a Mistral
	Models, such as any fine-tuned versions of the Mistral Models, the weights, or
	parameters.

	"You": means the individual or entity entering into this Agreement with
	Mistral AI.


	*Mistral AI processes your personal data below to provide the model and
	enforce its license. If you are affiliated with a commercial entity, we may
	also send you communications about our models. For more information on your
	rights and data handling, please see our <a
	href="https://mistral.ai/terms/">privacy policy</a>.*
	extra_gated_fields:
	First Name: text
	Last Name: text
	Country: country
	Affiliation: text
	Job title: text
	I understand that I can only use the model, any derivative versions and their outputs for non-commercial research purposes: checkbox
	I understand that if I am a commercial entity, I am not permitted to use or distribute the model internally or externally, or expose it in my own offerings without a commercial license: checkbox
	I understand that if I upload the model, or any derivative version, on any platform, I must include the Mistral Research License: checkbox
	I understand that for commercial use of the model, I can contact Mistral or use the Mistral AI API on la Plateforme or any of our cloud provider partners: checkbox
	By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Mistral Privacy Policy: checkbox
	geo: ip_location
	extra_gated_description: >-
	Mistral AI processes your personal data below to provide the model and enforce
	its license. If you are affiliated with a commercial entity, we may also send
	you communications about our models. For more information on your rights and
	data handling, please see our <a href="https://mistral.ai/terms/">privacy
	policy</a>.
	extra_gated_button_content: Submit
	library_name: vllm
	pipeline_tag: image-text-to-text
	language:
	- en
	- fr
	- de
	- es
	- it
	- pt
	- zh
	- ja
	- ru
	- ko
	base_model: neuralmagic/Pixtral-Large-Instruct-2411-hf
	---

	# Pixtral-Large-Instruct-2411-hf-quantized.w8a8

	## Model Overview
	- Model Architecture: neuralmagic/Pixtral-Large-Instruct-2411-hf
	- Input: Vision-Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: INT8
	- Activation quantization: INT8
	- Release Date: 2/24/2025
	- Version: 1.0
	- Model Developers: Neural Magic

	Quantized version of [neuralmagic/Pixtral-Large-Instruct-2411-hf](https://huggingface.co/neuralmagic/Pixtral-Large-Instruct-2411-hf/tree/main).

	### Model Optimizations

	This model was obtained by quantizing the weights of [neuralmagic/Pixtral-Large-Instruct-2411-hf](https://huggingface.co/neuralmagic/Pixtral-Large-Instruct-2411-hf/tree/main) to INT8 data type, ready for inference with vLLM >= 0.5.2.

	## Deployment

	### Use with vLLM

	This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

	```python
	from vllm.assets.image import ImageAsset
	from vllm import LLM, SamplingParams

	# prepare model
	llm = LLM(
	model="neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8",
	trust_remote_code=True,
	max_model_len=4096,
	max_num_seqs=2,
	)

	# prepare inputs
	question = "What is the content of this image?"
	inputs = {
	"prompt": f"<\|user\|>\n<\|image_1\|>\n{question}<\|end\|>\n<\|assistant\|>\n",
	"multi_modal_data": {
	"image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
	},
	}

	# generate response
	print("========== SAMPLE GENERATION ==============")
	outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
	print(f"PROMPT : {outputs[0].prompt}")
	print(f"RESPONSE: {outputs[0].outputs[0].text}")
	print("==========================================")
	```

	vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

	## Creation

	This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below as part a multimodal announcement blog.

	<details>
	<summary>Model Creation Code</summary>

	```python
	import requests
	import torch
	from PIL import Image
	from transformers import AutoProcessor
	from llmcompressor.modifiers.quantization import GPTQModifier
	from llmcompressor.transformers import oneshot
	from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration

	# Load model.
	model_id = "neuralmagic/Pixtral-Large-Instruct-2411-hf"
	model = TraceableLlavaForConditionalGeneration.from_pretrained(
	model_id, device_map="auto", torch_dtype="auto"
	)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

	# Oneshot arguments
	DATASET_ID = "flickr30k"
	DATASET_SPLIT = {"calibration": "test[:512]"}
	NUM_CALIBRATION_SAMPLES = 512
	MAX_SEQUENCE_LENGTH = 2048


	# Define a oneshot data collator for multimodal inputs.
	def data_collator(batch):
	assert len(batch) == 1
	return {
	"input_ids": torch.LongTensor(batch[0]["input_ids"]),
	"attention_mask": torch.tensor(batch[0]["attention_mask"]),
	"pixel_values": torch.tensor(batch[0]["pixel_values"]),
	}


	# Recipe
	recipe = [
	GPTQModifier(
	targets="Linear",
	scheme="W8A8",
	sequential_targets=["MistralDecoderLayer"],
	ignore=["re:.lm_head", "re:vision_tower.", "re:multi_modal_projector.*"],
	),
	]

	SAVE_DIR==f"{model_id.split('/')[1]}-quantized.w8a8"

	# Perform oneshot
	oneshot(
	model=model,
	tokenizer=model_id,
	dataset=DATASET_ID,
	splits=DATASET_SPLIT,
	recipe=recipe,
	max_seq_length=MAX_SEQUENCE_LENGTH,
	num_calibration_samples=NUM_CALIBRATION_SAMPLES,
	trust_remote_code_model=True,
	data_collator=data_collator,
	output_dir=SAVE_DIR
	)
	```
	</details>

	## Evaluation

	The model was evaluated using [mistral-evals](https://github.com/neuralmagic/mistral-evals) for vision-related tasks and using [lm_evaluation_harness](https://github.com/neuralmagic/lm-evaluation-harness) for select text-based benchmarks. The evaluations were conducted using the following commands:

	<details>
	<summary>Evaluation Commands</summary>

	### Vision Tasks
	- vqav2
	- docvqa
	- mathvista
	- mmmu
	- chartqa

	```
	vllm serve neuralmagic/pixtral-12b-quantized.w8a8 --tensor_parallel_size 1 --max_model_len 25000 --trust_remote_code --max_num_seqs 8 --gpu_memory_utilization 0.9 --dtype float16 --limit_mm_per_prompt image=7

	python -m eval.run eval_vllm \
	--model_name neuralmagic/pixtral-12b-quantized.w8a8 \
	--url http://0.0.0.0:8000 \
	--output_dir ~/tmp \
	--eval_name <vision_task_name>
	```

	### Text-based Tasks
	#### MMLU

	```
	lm_eval \
	--model vllm \
	--model_args pretrained="<model_name>",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=<n>,gpu_memory_utilization=0.8,enable_chunked_prefill=True,trust_remote_code=True \
	--tasks mmlu \
	--num_fewshot 5 \
	--batch_size auto \
	--output_path output_dir

	```

	#### MGSM

	```
	lm_eval \
	--model vllm \
	--model_args pretrained="<model_name>",dtype=auto,max_model_len=4096,max_gen_toks=2048,max_num_seqs=128,tensor_parallel_size=<n>,gpu_memory_utilization=0.9 \
	--tasks mgsm_cot_native \
	--apply_chat_template \
	--num_fewshot 0 \
	--batch_size auto \
	--output_path output_dir

	```
	</details>


	### Accuracy

	<table>
	<thead>
	<tr>
	<th>Category</th>
	<th>Metric</th>
	<th>neuralmagic/Pixtral-Large-Instruct-2411-hf</th>
	<th>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</th>
	<th>Recovery (%)</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td rowspan="6"><b>Vision</b></td>
	<td>MMMU (val, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
	<td>63.56</td>
	<td>63.89</td>
	<td>100.52%</td>
	</tr>
	<tr>
	<td>VQAv2 (val)<br><i>vqa_match</i></td>
	<td>79.03</td>
	<td>79.12</td>
	<td>100.11%</td>
	</tr>
	<tr>
	<td>DocVQA (val)<br><i>anls</i></td>
	<td>89.55</td>
	<td>89.80</td>
	<td>100.28%</td>
	</tr>
	<tr>
	<td>ChartQA (test, CoT)<br><i>anywhere_in_answer_relaxed_correctness</i></td>
	<td>82.24</td>
	<td>80.44</td>
	<td>97.81%</td>
	</tr>
	<tr>
	<td>Mathvista (testmini, CoT)<br><i>explicit_prompt_relaxed_correctness</i></td>
	<td>67.3</td>
	<td>66.50</td>
	<td>98.81%</td>
	</tr>
	<tr>
	<td><b>Average Score</b></td>
	<td><b>76.34</b></td>
	<td><b>75.95</b></td>
	<td><b>99.49%</b></td>
	</tr>
	<tr>
	<td rowspan="2"><b>Text</b></td>
	<td>MGSM (CoT)</td>
	<td>76.05</td>
	<td>74.76</td>
	<td>98.30%</td>
	</tr>
	<tr>
	<td>MMLU (5-shot)</td>
	<td>82.8</td>
	<td>82.9</td>
	<td>100.12%</td>
	</tr>
	</tbody>
	</table>


	## Inference Performance


	This model achieves up to 1.87x speedup in single-stream deployment and up to 2.0x speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
	The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.7.2, and [GuideLLM](https://github.com/neuralmagic/guidellm).

	<details>
	<summary>Benchmarking Command</summary>
	```
	guidellm --model neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8 --target "http://localhost:8000/v1" --data-type emulated --data prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>,images=<num_images>,width=<image_width>,height=<image_height> --max seconds 120 --backend aiohttp_server
	```

	</details>

	### Single-stream performance (measured with vLLM version 0.7.2)

	<table border="1" class="dataframe">
	<thead>
	<tr>
	<th></th>
	<th></th>
	<th></th>
	<th></th>
	<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
	<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
	<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
	</tr>
	<tr>
	<th>Hardware</th>
	<th>Number of GPUs</th>
	<th>Model</th>
	<th>Average Cost Reduction</th>
	<th>Latency (s)</th>
	<th>Queries Per Dollar</th>
	<th>Latency (s)</th>
	<th>Queries Per Dollar</th>
	<th>Latency (s)</th>
	<th>Queries Per Dollar</th>
	</tr>
	</thead>
	<tbody style="text-align: center">
	<tr>
	<th rowspan="3" valign="top">A100</th>
	<td>4</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td>
	<td></td>
	<td>7.5</td>
	<td>67</td>
	<td>6.5</td>
	<td>77</td>
	<td>6.4</td>
	<td>79</td>
	</tr>
	<tr>
	<td>2</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
	<td>1.86</td>
	<td>8.1</td>
	<td>124</td>
	<td>7.1</td>
	<td>142</td>
	<td>6.8</td>
	<td>148</td>
	</tr>
	<tr>
	<td>2</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
	<td>2.52</td>
	<td>6.9</td>
	<td>147</td>
	<td>5.1</td>
	<td>199</td>
	<td>4.5</td>
	<td>221</td>
	</tr>
	<tr>
	<th rowspan="3" valign="top">H100</th>
	<td>4</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td>
	<td></td>
	<td>4.4</td>
	<td>67</td>
	<td>3.9</td>
	<td>74</td>
	<td>3.7</td>
	<td>79</td>
	</tr>
	<tr>
	<td>2</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
	<td>1.82</td>
	<td>4.7</td>
	<td>120</td>
	<td>4.1</td>
	<td>137</td>
	<td>3.9</td>
	<td>145</td>
	</tr>
	<tr>
	<td>2</td>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
	<td>1.87</td>
	<td>4.7</td>
	<td>120</td>
	<td>3.9</td>
	<td>144</td>
	<td>3.8</td>
	<td>149</td>
	</tr>
	</tbody>
	</table>

	**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

	**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).

	### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)

	<table border="1" class="dataframe">
	<thead>
	<tr>
	<th></th>
	<th></th>
	<th></th>
	<th style="text-align: center;" colspan="2" >Document Visual Question Answering<br>1680W x 2240H<br>64/128</th>
	<th style="text-align: center;" colspan="2" >Visual Reasoning <br>640W x 480H<br>128/128</th>
	<th style="text-align: center;" colspan="2" >Image Captioning<br>480W x 360H<br>0/128</th>
	</tr>
	<tr>
	<th>Hardware</th>
	<th>Model</th>
	<th>Average Cost Reduction</th>
	<th>Maximum throughput (QPS)</th>
	<th>Queries Per Dollar</th>
	<th>Maximum throughput (QPS)</th>
	<th>Queries Per Dollar</th>
	<th>Maximum throughput (QPS)</th>
	<th>Queries Per Dollar</th>
	</tr>
	</thead>
	<tbody style="text-align: center">
	<tr>
	<th rowspan="3" valign="top">A100x4</th>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td>
	<td></td>
	<td>0.4</td>
	<td>222</td>
	<td>0.7</td>
	<td>341</td>
	<td>0.8</td>
	<td>399</td>
	</tr>
	<tr>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w8a8</td>
	<td>1.70</td>
	<td>0.8</td>
	<td>383</td>
	<td>1.1</td>
	<td>571</td>
	<td>1.3</td>
	<td>674</td>
	</tr>
	<tr>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
	<td>1.48</td>
	<td>0.5</td>
	<td>276</td>
	<td>1.0</td>
	<td>505</td>
	<td>1.4</td>
	<td>680</td>
	</tr>
	<tr>
	<<th rowspan="3" valign="top">H100x4</th>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf</td>
	<td></td>
	<td>1.0</td>
	<td>284</td>
	<td>1.6</td>
	<td>465</td>
	<td>1.8</td>
	<td>511</td>
	</tr>
	<tr>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-FP8-Dynamic</td>
	<td>1.61</td>
	<td>1.7</td>
	<td>467</td>
	<td>2.6</td>
	<td>726</td>
	<td>3.2</td>
	<td>908</td>
	</tr>
	<tr>
	<td>neuralmagic/Pixtral-Large-Instruct-2411-hf-quantized.w4a16</td>
	<td>1.33</td>
	<td>1.4</td>
	<td>393</td>
	<td>2.2</td>
	<td>726</td>
	<td>2.7</td>
	<td>764</td>
	</tr>
	</tbody>
	</table>

	**Use case profiles: Image Size (WxH) / prompt tokens / generation tokens

	**QPS: Queries per second.

	**QPD: Queries per dollar, based on on-demand cost at [Lambda Labs](https://lambdalabs.com/service/gpu-cloud) (observed on 2/18/2025).