File size: 7,907 Bytes

---
library_name: vllm
license: apache-2.0
language:
  - en
  - fr
  - es
  - it
  - pt
  - zh
  - ar
  - ru
base_model:
  - HuggingFaceTB/SmolLM3-3B
tags:
- neuralmagic
- redhat
- llmcompressor
- int4
- w4a16
- quantized
---

## Model Overview
- **Model Architecture:** SmolLM3-3B
  - **Input:** Text
  - **Output:** Text
- **Model Optimizations:**
  - **Weight quantization:** INT4
  - **Activation quantization:** None
- **Release Date:** 07/31/2025
- **Version:** 1.0
- **License(s):** Apache-2.0
- **Model Developers:** RedHat (Neural Magic)

### Model Optimizations

This model was obtained by quantizing weights of [SmolLM3-3B](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) to INT4 data type.
This optimization reduces the number of bits used to represent weights and activations from 16 to 4, reducing GPU memory requirements (by approximately 75%).
Weight quantization also reduces disk size requirements by approximately 75%.
Only weights of the linear operators within transformers blocks are quantized.
The [llm-compressor](https://github.com/vllm-project/llm-compressor) library is used for quantization.

## Deployment

This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/SmolLM3-3B-quantized.w4a16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)
```

vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.


## Creation

<details>
  <summary>Creation details</summary>
  This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below with:

```bash
python int4.py --model_path HuggingFaceTB/SmolLM3-3B --calib_size 1024 --dampening_frac 0.1 --observer minmax --actorder group --sym false
```
where `int4.py` is as follows:


  ```python
import argparse
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM

from compressed_tensors.quantization import (
    QuantizationScheme,
    QuantizationArgs,
    QuantizationType,
    QuantizationStrategy,
)
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

# Constants
DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"
MAX_SEQ_LENGTH = 8192
IGNORE_MODULES = ["lm_head"]

# Argument Parsing Utilities
def parse_actorder(value: str):
    value_lower = value.lower()
    if value_lower == "false":
        return False
    if value_lower in {"weight", "group"}:
        return value_lower
    raise argparse.ArgumentTypeError(f"Invalid --actorder. Choose 'group', 'weight', or 'false', got {value}")

def parse_sym(value: str):
    value_lower = value.lower()
    if value_lower in {"true", "false"}:
        return value_lower == "true"
    raise argparse.ArgumentTypeError(f"Invalid --sym. Use 'true' or 'false', got {value}")

# Argument Parser
def get_args():
    parser = argparse.ArgumentParser(description="Quantize a model with GPTQModifier.")
    parser.add_argument('--model_path', type=str, required=True, help="Path to the unquantized model.")
    parser.add_argument('--calib_size', type=int, default=256, help="Number of samples for calibration.")
    parser.add_argument('--dampening_frac', type=float, default=0.1, help="Dampening fraction for quantization.")
    parser.add_argument('--observer', type=str, default="minmax", help="Observer type used for quantization.")
    parser.add_argument('--sym', type=parse_sym, default=True, help="Symmetric quantization (true/false).")
    parser.add_argument('--actorder', type=parse_actorder, default=False,
                        help="Activation order: 'group', 'weight', or 'false'.")
    return parser.parse_args()

def main():
    args = get_args()

    model = AutoModelForCausalLM.from_pretrained(
        args.model_path,
        device_map="auto",
        torch_dtype="auto",
        use_cache=False,
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(args.model_path)

    # Load and preprocess dataset
    ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
    ds = ds.shuffle(seed=42).select(range(args.calib_size))
    ds = ds.map(lambda x: {"text": x["text"]})
    ds = ds.map(
        lambda x: tokenizer(x["text"], padding=False, truncation=False, add_special_tokens=True),
        remove_columns=ds.column_names
    )

    # Build Quantization Scheme
    quant_scheme = QuantizationScheme(
        targets=["Linear"],
        weights=QuantizationArgs(
            num_bits=4,
            type=QuantizationType.INT,
            symmetric=args.sym,
            group_size=128,
            strategy=QuantizationStrategy.GROUP,
            observer=args.observer,
            actorder=args.actorder
        ),
        input_activations=None,
        output_activations=None,
    )

    # Define compression recipe
    recipe = [
        GPTQModifier(
            targets=["Linear"],
            ignore=IGNORE_MODULES,
            dampening_frac=args.dampening_frac,
            config_groups={"group_0": quant_scheme},
        )
    ]

    # Apply quantization
    oneshot(
        model=model,
        dataset=ds,
        recipe=recipe,
        num_calibration_samples=args.calib_size,
        max_seq_length=MAX_SEQ_LENGTH,
    )

    # Save the quantized model
    save_path = f"{args.model_path}-quantized.w4a16"
    model.save_pretrained(save_path, save_compressed=True)
    tokenizer.save_pretrained(save_path)

if __name__ == "__main__":
    main()
```

</details>

## Evaluation

This model was evaluated on the well-known reasoning tasks: AIME24, Math-500, and GPQA-Diamond.
In all cases, model outputs were generated with the [vLLM](https://docs.vllm.ai/en/stable/) engine, and evals are collected through [LightEval](https://github.com/huggingface/lighteval) library.


<details>
  <summary>Evaluation details</summary>

  ```
    export VLLM_WORKER_MULTIPROC_METHOD=spawn
    export MODEL="RedHatAI/SmolLM3-3B-quantized.w4a16"
    export MODEL_ARGS="model_name=$MODEL,dtype=auto,max_model_length=65536,gpu_memory_utilization=0.9,tensor_parallel_size=1,add_special_tokens=False,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

    export TASK=aime24 # {aime24, math_500, gpqa:diamond}

    lighteval vllm $MODEL_ARGS "lighteval|${TASK}|0|0" \
        --use-chat-template \
        --output-dir out_dir
  ```
</details>

### Accuracy

<table>
  <tr>
   <th>Category
   </th>
   <th>Benchmark
   </th>
   <th>HuggingFaceTB/SmolLM3-3B
   </th>
   <th>RedHatAI/SmolLM3-3B-quantized.w4a16<br>(this model)
   </th>
   <th>Recovery
   </th>
  </tr>
  <tr>
   <td rowspan="8" ><strong>Reasoning</strong>
   </td>
   <td>AIME24 (pass@1:64)
   </td>
   <td>45.31
   </td>
   <td>39.27
   </td>
   <td>86.67%
   </td>
  </tr>
  <tr>
   <td>MATH-500 (pass@1:4)
   </td>
   <td>89.30
   </td>
   <td>87.55
   </td>
   <td>98.04%
   </td>
  </tr>
  <tr>
   <td>GPQA-Diamond (pass@1:8)
   </td>
   <td>41.22
   </td>
   <td>41.86
   </td>
   <td>101.55%
   </td>
  </tr>
  <tr>
   <td><strong>Average</strong>
   </td>
   <td><strong>58.61</strong>
   </td>
   <td><strong>56.23</strong>
   </td>
   <td><strong>95.94%</strong>
   </td>
  </tr>
  <tr>
</table>