ekurtic commited on
Commit
aec91b1
·
verified ·
1 Parent(s): 2ef8c7c

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +131 -56
README.md CHANGED
@@ -16,7 +16,8 @@ tags:
16
  - neuralmagic
17
  - redhat
18
  - llmcompressor
19
- - fp8
 
20
  - quantized
21
  ---
22
 
@@ -25,21 +26,19 @@ tags:
25
  - **Input:** Text
26
  - **Output:** Text
27
  - **Model Optimizations:**
28
- - **Weight quantization:** FP8
29
- - **Activation quantization:** FP8
30
- - **Release Date:** 07/28/2025
31
  - **Version:** 1.0
32
  - **License(s):** Apache-2.0
33
  - **Model Developers:** RedHat (Neural Magic)
34
 
35
  ### Model Optimizations
36
 
37
- This model was obtained by quantizing activation and weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to FP8 data type.
38
- This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
39
- Weight quantization also reduces disk size requirements by approximately 50%.
40
-
41
- Only weights and activations of the linear operators within transformers blocks are quantized.
42
- Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
43
  The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
44
 
45
  ## Deployment
@@ -50,7 +49,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
50
  from vllm import LLM, SamplingParams
51
  from transformers import AutoTokenizer
52
 
53
- model_id = "RedHatAI/SmolLM3-3B-FP8-dynamic"
54
  number_gpus = 1
55
 
56
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
@@ -83,41 +82,117 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
83
 
84
 
85
  ```python
86
- from transformers import AutoModelForCausalLM, AutoTokenizer
87
- from llmcompressor.modifiers.quantization import QuantizationModifier
88
- from llmcompressor.transformers import oneshot
89
-
90
- # Load model
91
- model_stub = "HuggingFaceTB/SmolLM3-3B"
92
- model_name = model_stub.split("/")[-1]
93
-
94
- tokenizer = AutoTokenizer.from_pretrained(model_stub)
95
-
96
- model = AutoModelForCausalLM.from_pretrained(
97
- model_stub,
98
- device_map="auto",
99
- torch_dtype="auto",
100
- )
101
-
102
- # Configure the quantization algorithm and scheme
103
- recipe = QuantizationModifier(
104
- targets="Linear",
105
- scheme="FP8_dynamic",
106
- ignore=["lm_head"],
107
- )
108
-
109
- # Apply quantization
110
- oneshot(
111
- model=model,
112
- recipe=recipe,
113
- )
114
-
115
- # Save to disk in compressed-tensors format
116
- save_path = model_name + "-FP8-dynamic"
117
- model.save_pretrained(save_path)
118
- tokenizer.save_pretrained(save_path)
119
- print(f"Model and tokenizer saved to: {save_path}")
120
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
  </details>
122
 
123
  ## Evaluation
@@ -131,7 +206,7 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
131
 
132
  ```
133
  export VLLM_WORKER_MULTIPROC_METHOD=spawn
134
- export MODEL="RedHatAI/SmolLM3-3B-FP8-dynamic"
135
  export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
136
 
137
  export TASK=aime24 # {aime24, math_500, gpqa:diamond}
@@ -152,7 +227,7 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
152
  </th>
153
  <th>HuggingFaceTB/SmolLM3-3B
154
  </th>
155
- <th>RedHatAI/SmolLM3-3B-FP8-dynamic<br>(this model)
156
  </th>
157
  <th>Recovery
158
  </th>
@@ -164,9 +239,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
164
  </td>
165
  <td>45.31
166
  </td>
167
- <td>47.50
168
  </td>
169
- <td>104.83%
170
  </td>
171
  </tr>
172
  <tr>
@@ -174,9 +249,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
174
  </td>
175
  <td>89.30
176
  </td>
177
- <td>88.30
178
  </td>
179
- <td>98.88%
180
  </td>
181
  </tr>
182
  <tr>
@@ -184,9 +259,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
184
  </td>
185
  <td>41.22
186
  </td>
187
- <td>40.91
188
  </td>
189
- <td>99.25%
190
  </td>
191
  </tr>
192
  <tr>
@@ -194,9 +269,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
194
  </td>
195
  <td><strong>58.61</strong>
196
  </td>
197
- <td><strong>58.90</strong>
198
  </td>
199
- <td><strong>100.5%</strong>
200
  </td>
201
  </tr>
202
  <tr>
 
16
  - neuralmagic
17
  - redhat
18
  - llmcompressor
19
+ - int4
20
+ - w4a16
21
  - quantized
22
  ---
23
 
 
26
  - **Input:** Text
27
  - **Output:** Text
28
  - **Model Optimizations:**
29
+ - **Weight quantization:** INT4
30
+ - **Activation quantization:** None
31
+ - **Release Date:** 07/31/2025
32
  - **Version:** 1.0
33
  - **License(s):** Apache-2.0
34
  - **Model Developers:** RedHat (Neural Magic)
35
 
36
  ### Model Optimizations
37
 
38
+ This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type.
39
+ This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%).
40
+ Weight quantization also reduces disk size requirements by approximately 75%.
41
+ Only weights of the linear operators within transformers blocks are quantized.
 
 
42
  The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
43
 
44
  ## Deployment
 
49
  from vllm import LLM, SamplingParams
50
  from transformers import AutoTokenizer
51
 
52
+ model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16"
53
  number_gpus = 1
54
 
55
  sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
 
82
 
83
 
84
  ```python
85
+ import argparse
86
+ from datasets import load_dataset
87
+ from transformers import AutoTokenizer, AutoModelForCausalLM
88
+
89
+ from compressed_tensors.quantization import (
90
+ QuantizationScheme,
91
+ QuantizationArgs,
92
+ QuantizationType,
93
+ QuantizationStrategy,
94
+ )
95
+ from llmcompressor.modifiers.quantization import GPTQModifier
96
+ from llmcompressor.transformers import oneshot
97
+
98
+ # Constants
99
+ DATASET_ID = "neuralmagic/LLM_compression_calibration"
100
+ DATASET_SPLIT = "train"
101
+ MAX_SEQ_LENGTH = 8192
102
+ IGNORE_MODULES = ["lm_head"]
103
+
104
+ # Argument Parsing Utilities
105
+ def parse_actorder(value: str):
106
+ value_lower = value.lower()
107
+ if value_lower == "false":
108
+ return False
109
+ if value_lower in {"weight", "group"}:
110
+ return value_lower
111
+ raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}")
112
+
113
+ def parse_sym(value: str):
114
+ value_lower = value.lower()
115
+ if value_lower in {"true", "false"}:
116
+ return value_lower == "true"
117
+ raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}")
118
+
119
+ # Argument Parser
120
+ def get_args():
121
+ parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.")
122
+ parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.")
123
+ parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.")
124
+ parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.")
125
+ parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.")
126
+ parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).")
127
+ parser.add_argument('--actorder', type=parse_actorder, default=False,
128
+ help="Activation order: 'group', 'weight', or 'false'.")
129
+ return parser.parse_args()
130
+
131
+ def main():
132
+ args = get_args()
133
+
134
+ model = AutoModelForCausalLM.from_pretrained(
135
+ args.model_path,
136
+ device_map="auto",
137
+ torch_dtype="auto",
138
+ use_cache=False,
139
+ trust_remote_code=True,
140
+ )
141
+ tokenizer = AutoTokenizer.from_pretrained(args.model_path)
142
+
143
+ # Load and preprocess dataset
144
+ ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
145
+ ds = ds.shuffle(seed=42).select(range(args.calib_size))
146
+ ds = ds.map(lambda x: {"text": x["text"]})
147
+ ds = ds.map(
148
+ lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True),
149
+ remove_columns=ds.column_names
150
+ )
151
+
152
+ # Build Quantization Scheme
153
+ quant_scheme = QuantizationScheme(
154
+ targets=["Linear"],
155
+ weights=QuantizationArgs(
156
+ num_bits=4,
157
+ type=QuantizationType.INT,
158
+ symmetric=args.sym,
159
+ group_size=128,
160
+ strategy=QuantizationStrategy.GROUP,
161
+ observer=args.observer,
162
+ actorder=args.actorder
163
+ ),
164
+ input_activations=None,
165
+ output_activations=None,
166
+ )
167
+
168
+ # Define compression recipe
169
+ recipe = [
170
+ GPTQModifier(
171
+ targets=["Linear"],
172
+ ignore=IGNORE_MODULES,
173
+ dampening_frac=args.dampening_frac,
174
+ config_groups={"group_0": quant_scheme},
175
+ )
176
+ ]
177
+
178
+ # Apply quantization
179
+ oneshot(
180
+ model=model,
181
+ dataset=ds,
182
+ recipe=recipe,
183
+ num_calibration_samples=args.calib_size,
184
+ max_seq_length=MAX_SEQ_LENGTH,
185
+ )
186
+
187
+ # Save the quantized model
188
+ save_path = f"{args.model_path}-quantized.w4a16"
189
+ model.save_pretrained(save_path, save_compressed=True)
190
+ tokenizer.save_pretrained(save_path)
191
+
192
+ if __name__ == "__main__":
193
+ main()
194
+ ```
195
+
196
  </details>
197
 
198
  ## Evaluation
 
206
 
207
  ```
208
  export VLLM_WORKER_MULTIPROC_METHOD=spawn
209
+ export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16"
210
  export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
211
 
212
  export TASK=aime24 # {aime24, math_500, gpqa:diamond}
 
227
  </th>
228
  <th>HuggingFaceTB/SmolLM3-3B
229
  </th>
230
+ <th>RedHatAI/SmolLM3-3B-quantized.w4a16<br>(this model)
231
  </th>
232
  <th>Recovery
233
  </th>
 
239
  </td>
240
  <td>45.31
241
  </td>
242
+ <td>39.27
243
  </td>
244
+ <td>86.67%
245
  </td>
246
  </tr>
247
  <tr>
 
249
  </td>
250
  <td>89.30
251
  </td>
252
+ <td>87.55
253
  </td>
254
+ <td>98.04%
255
  </td>
256
  </tr>
257
  <tr>
 
259
  </td>
260
  <td>41.22
261
  </td>
262
+ <td>41.86
263
  </td>
264
+ <td>101.55%
265
  </td>
266
  </tr>
267
  <tr>
 
269
  </td>
270
  <td><strong>58.61</strong>
271
  </td>
272
+ <td><strong>56.23</strong>
273
  </td>
274
+ <td><strong>95.94%</strong>
275
  </td>
276
  </tr>
277
  <tr>