shubhrapandit commited on
Commit
fd86b53
·
verified ·
1 Parent(s): de9eac8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -0
README.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - w4a16
4
+ - int4
5
+ - vllm
6
+ - vision
7
+ license: apache-2.0
8
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
9
+ language:
10
+ - en
11
+ base_model: mgoin/pixtral-12b
12
+ library_name: transformers
13
+ ---
14
+
15
+ # Phi-3-vision-128k-instruct-W4A16-G128
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** mgoin/pixtral-12b
19
+ - **Input:** Vision-Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Activation quantization:** FP16
24
+ - **Release Date:** 2/24/2025
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [mgoin/pixtral-12b](https://huggingface.co/mgoin/pixtral-12b).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights of [mgoin/pixtral-12b](https://huggingface.co/mgoin/pixtral-12b) to INT4 data type, ready for inference with vLLM >= 0.5.2.
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm.assets.image import ImageAsset
42
+ from vllm import LLM, SamplingParams
43
+
44
+ # prepare model
45
+ llm = LLM(
46
+ model="neuralmagic/pixtral-12b-quantized.w4a16",
47
+ trust_remote_code=True,
48
+ max_model_len=4096,
49
+ max_num_seqs=2,
50
+ )
51
+
52
+ # prepare inputs
53
+ question = "What is the content of this image?"
54
+ inputs = {
55
+ "prompt": f"<|user|>\n<|image_1|>\n{question}<|end|>\n<|assistant|>\n",
56
+ "multi_modal_data": {
57
+ "image": ImageAsset("cherry_blossom").pil_image.convert("RGB")
58
+ },
59
+ }
60
+
61
+ # generate response
62
+ print("========== SAMPLE GENERATION ==============")
63
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64))
64
+ print(f"PROMPT : {outputs[0].prompt}")
65
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
66
+ print("==========================================")
67
+ ```
68
+
69
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
70
+
71
+ ## Creation
72
+
73
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below as part a multimodal announcement blog.
74
+
75
+ <details>
76
+ <summary>Model Creation Code</summary>
77
+
78
+ ```python
79
+ import requests
80
+ import torch
81
+ from PIL import Image
82
+ from transformers import AutoProcessor
83
+
84
+ from llmcompressor.modifiers.quantization import GPTQModifier
85
+ from llmcompressor.transformers import oneshot
86
+ from llmcompressor.transformers.tracing import TraceableLlavaForConditionalGeneration
87
+ import os
88
+ from clearml import InputModel, OutputModel
89
+ from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationStrategy, ActivationOrdering, QuantizationScheme
90
+
91
+ # Load model.
92
+ model_id = "mgoin/pixtral-12b"
93
+ model = TraceableLlavaForConditionalGeneration.from_pretrained(
94
+ model_id, device_map="auto", torch_dtype="auto"
95
+ )
96
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
97
+
98
+ # Oneshot arguments
99
+ DATASET_ID = "flickr30k"
100
+ DATASET_SPLIT = {"calibration": "test[:512]"}
101
+ NUM_CALIBRATION_SAMPLES = 512
102
+ MAX_SEQUENCE_LENGTH = 2048
103
+ dampening_frac=0.01
104
+
105
+ # Define a oneshot data collator for multimodal inputs.
106
+ def data_collator(batch):
107
+ assert len(batch) == 1
108
+ return {
109
+ "input_ids": torch.LongTensor(batch[0]["input_ids"]),
110
+ "attention_mask": torch.tensor(batch[0]["attention_mask"]),
111
+ "pixel_values": torch.tensor(batch[0]["pixel_values"]),
112
+ }
113
+
114
+ recipe = GPTQModifier(
115
+ targets="Linear",
116
+ config_groups={
117
+ "config_group": QuantizationScheme(
118
+ targets=["Linear"],
119
+ weights=QuantizationArgs(
120
+ num_bits=4,
121
+ type=QuantizationType.INT,
122
+ strategy=QuantizationStrategy.GROUP,
123
+ group_size=128,
124
+ symmetric=True,
125
+ dynamic=False,
126
+ actorder=ActivationOrdering.WEIGHT,
127
+ ),
128
+ ),
129
+ },
130
+ sequential_targets=["MistralDecoderLayer"],
131
+ ignore=["re:.*lm_head", "re:vision_tower.*", "re:multi_modal_projector.*"],
132
+ update_size=NUM_CALIBRATION_SAMPLES,
133
+ dampening_frac=dampening_frac,
134
+ )
135
+
136
+ SAVE_DIR = f"{model_id.split('/')[1]}-quantized.w4a16"
137
+
138
+ # Perform oneshot
139
+ oneshot(
140
+ model=model,
141
+ tokenizer=model_id,
142
+ dataset=DATASET_ID,
143
+ splits=DATASET_SPLIT,
144
+ recipe=recipe,
145
+ max_seq_length=MAX_SEQUENCE_LENGTH,
146
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
147
+ trust_remote_code_model=True,
148
+ data_collator=data_collator,
149
+ output_dir=SAVE_DIR
150
+ )
151
+
152
+ ```
153
+ </details>
154
+
155
+ ## Evaluation
156
+
157
+ The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), OpenLLM Leaderboard [V2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/) and on [HumanEval](https://github.com/neuralmagic/evalplus), using the following commands:
158
+
159
+ <details>
160
+ <summary>Evaluation Commands</summary>
161
+
162
+ ```
163
+ guidellm --model neuralmagic/granite-3.1-8b-instruct-quantized.w4a16 --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
164
+ ```
165
+
166
+ </details>
167
+
168
+ ### Accuracy
169
+
170
+ ## Inference Performance
171
+
172
+
173
+ This model achieves up to xxx speedup in single-stream deployment and up to xxx speedup in multi-stream asynchronous deployment, depending on hardware and use-case scenario.
174
+ The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.7.2, and [GuideLLM](https://github.com/neuralmagic/guidellm).
175
+
176
+ <details>
177
+ <summary>Benchmarking Command</summary>
178
+
179
+ </details>
180
+
181
+ ### Single-stream performance (measured with vLLM version 0.7.2)
182
+
183
+ ### Multi-stream asynchronous performance (measured with vLLM version 0.7.2)