RedHatAI
/

SmolLM3-3B-quantized.w4a16

@@ -16,7 +16,8 @@ tags:
 - neuralmagic
 - redhat
 - llmcompressor
-- fp8
 - quantized
 ---
@@ -25,21 +26,19 @@ tags:
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
-  - **Weight quantization:** FP8
-  - **Activation quantization:** FP8
-- **Release Date:** 07/28/2025
 - **Version:** 1.0
 - **License(s):** Apache-2.0
 - **Model Developers:** RedHat (Neural Magic)
 ### Model Optimizations
-This model was obtained by quantizing activation and weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to FP8 data type.
-This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x).
-Weight quantization also reduces disk size requirements by approximately 50%.
-Only weights and activations of the linear operators within transformers blocks are quantized.
-Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme.
 The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
 ## Deployment
@@ -50,7 +49,7 @@ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
-model_id = "RedHatAI/SmolLM3-3B-FP8-dynamic"
 number_gpus = 1
 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
@@ -83,41 +82,117 @@ vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://do
   ```python
-  from transformers import AutoModelForCausalLM, AutoTokenizer
-  from llmcompressor.modifiers.quantization import QuantizationModifier
-  from llmcompressor.transformers import oneshot
-  # Load model
-  model_stub = "HuggingFaceTB/SmolLM3-3B"
-  model_name = model_stub.split("/")[-1]
-  tokenizer = AutoTokenizer.from_pretrained(model_stub)
-  model = AutoModelForCausalLM.from_pretrained(
-      model_stub,
-      device_map="auto",
-      torch_dtype="auto",
-  )
-  # Configure the quantization algorithm and scheme
-  recipe = QuantizationModifier(
-      targets="Linear",
-      scheme="FP8_dynamic",
-      ignore=["lm_head"],
-  )
-  # Apply quantization
-  oneshot(
-      model=model,
-      recipe=recipe,
-  )
-  # Save to disk in compressed-tensors format
-  save_path = model_name + "-FP8-dynamic"
-  model.save_pretrained(save_path)
-  tokenizer.save_pretrained(save_path)
-  print(f"Model and tokenizer saved to: {save_path}")
-  ```
 </details>
 ## Evaluation
@@ -131,7 +206,7 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
   ```
     export VLLM_WORKER_MULTIPROC_METHOD=spawn
-    export MODEL="RedHatAI/SmolLM3-3B-FP8-dynamic"
     export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
     export TASK=aime24 # {aime24, math_500, gpqa:diamond}
@@ -152,7 +227,7 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
    </th>
    <th>HuggingFaceTB/SmolLM3-3B
    </th>
-   <th>RedHatAI/SmolLM3-3B-FP8-dynamic<br>(this model)
    </th>
    <th>Recovery
    </th>
@@ -164,9 +239,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
    </td>
    <td>45.31
    </td>
-   <td>47.50
    </td>
-   <td>104.83%
    </td>
   </tr>
   <tr>
@@ -174,9 +249,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
    </td>
    <td>89.30
    </td>
-   <td>88.30
    </td>
-   <td>98.88%
    </td>
   </tr>
   <tr>
@@ -184,9 +259,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
    </td>
    <td>41.22
    </td>
-   <td>40.91
    </td>
-   <td>99.25%
    </td>
   </tr>
   <tr>
@@ -194,9 +269,9 @@ In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/
    </td>
    <td><strong>58.61</strong>
    </td>
-   <td><strong>58.90</strong>
    </td>
-   <td><strong>100.5%</strong>
    </td>
   </tr>
   <tr>

 - neuralmagic
 - redhat
 - llmcompressor
+- int4
+- w4a16
 - quantized
 ---
   - **Input:** Text
   - **Output:** Text
 - **Model Optimizations:**
+  - **Weight quantization:** INT4
+  - **Activation quantization:** None
+- **Release Date:** 07/31/2025
 - **Version:** 1.0
 - **License(s):** Apache-2.0
 - **Model Developers:** RedHat (Neural Magic)
 ### Model Optimizations
+This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type.
+This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%).
+Weight quantization also reduces disk size requirements by approximately 75%.
+Only weights of the linear operators within transformers blocks are quantized.
 The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.
 ## Deployment
 from vllm import LLM, SamplingParams
 from transformers import AutoTokenizer
+model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16"
 number_gpus = 1
 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
   ```python
+import argparse
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from compressed_tensors.quantization import (
+    QuantizationScheme,
+    QuantizationArgs,
+    QuantizationType,
+    QuantizationStrategy,
+)
+from llmcompressor.modifiers.quantization import GPTQModifier
+from llmcompressor.transformers import oneshot
+# Constants
+DATASET_ID = "neuralmagic/LLM_compression_calibration"
+DATASET_SPLIT = "train"
+MAX_SEQ_LENGTH = 8192
+IGNORE_MODULES = ["lm_head"]
+# Argument Parsing Utilities
+def parse_actorder(value: str):
+    value_lower = value.lower()
+    if value_lower == "false":
+        return False
+    if value_lower in {"weight", "group"}:
+        return value_lower
+    raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}")
+def parse_sym(value: str):
+    value_lower = value.lower()
+    if value_lower in {"true", "false"}:
+        return value_lower == "true"
+    raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}")
+# Argument Parser
+def get_args():
+    parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.")
+    parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.")
+    parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.")
+    parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.")
+    parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.")
+    parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).")
+    parser.add_argument('--actorder', type=parse_actorder, default=False,
+                        help="Activation order: 'group', 'weight', or 'false'.")
+    return parser.parse_args()
+def main():
+    args = get_args()
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_path,
+        device_map="auto",
+        torch_dtype="auto",
+        use_cache=False,
+        trust_remote_code=True,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    # Load and preprocess dataset
+    ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
+    ds = ds.shuffle(seed=42).select(range(args.calib_size))
+    ds = ds.map(lambda x: {"text": x["text"]})
+    ds = ds.map(
+        lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True),
+        remove_columns=ds.column_names
+    )
+    # Build Quantization Scheme
+    quant_scheme = QuantizationScheme(
+        targets=["Linear"],
+        weights=QuantizationArgs(
+            num_bits=4,
+            type=QuantizationType.INT,
+            symmetric=args.sym,
+            group_size=128,
+            strategy=QuantizationStrategy.GROUP,
+            observer=args.observer,
+            actorder=args.actorder
+        ),
+        input_activations=None,
+        output_activations=None,
+    )
+    # Define compression recipe
+    recipe = [
+        GPTQModifier(
+            targets=["Linear"],
+            ignore=IGNORE_MODULES,
+            dampening_frac=args.dampening_frac,
+            config_groups={"group_0": quant_scheme},
+        )
+    ]
+    # Apply quantization
+    oneshot(
+        model=model,
+        dataset=ds,
+        recipe=recipe,
+        num_calibration_samples=args.calib_size,
+        max_seq_length=MAX_SEQ_LENGTH,
+    )
+    # Save the quantized model
+    save_path = f"{args.model_path}-quantized.w4a16"
+    model.save_pretrained(save_path, save_compressed=True)
+    tokenizer.save_pretrained(save_path)
+if __name__ == "__main__":
+    main()
+```
 </details>
 ## Evaluation
   ```
     export VLLM_WORKER_MULTIPROC_METHOD=spawn
+    export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16"
     export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
     export TASK=aime24 # {aime24, math_500, gpqa:diamond}
    </th>
    <th>HuggingFaceTB/SmolLM3-3B
    </th>
+   <th>RedHatAI/SmolLM3-3B-quantized.w4a16<br>(this model)
    </th>
    <th>Recovery
    </th>
    </td>
    <td>45.31
    </td>
+   <td>39.27
    </td>
+   <td>86.67%
    </td>
   </tr>
   <tr>
    </td>
    <td>89.30
    </td>
+   <td>87.55
    </td>
+   <td>98.04%
    </td>
   </tr>
   <tr>
    </td>
    <td>41.22
    </td>
+   <td>41.86
    </td>
+   <td>101.55%
    </td>
   </tr>
   <tr>
    </td>
    <td><strong>58.61</strong>
    </td>
+   <td><strong>56.23</strong>
    </td>
+   <td><strong>95.94%</strong>
    </td>
   </tr>
   <tr>