Qwen2.5-3B-Instruct-FP4-W4A4

Model Overview

This model is a quantized version of Qwen/Qwen2.5-3B-Instruct using NVFP4 W4A4 quantization with the compressed-tensors format.

  • Quantization Method: NVFP4 (nvfp4-pack-quantized), W4A4 (4-bit weights, 4-bit activations)
  • Quantization Strategy: Per-group (group size 16), symmetric
  • Calibration Dataset: CNN/DailyMail (512 samples, max sequence length 2048)
  • Format: compressed-tensors
  • Model Size: ~2.7 GB (vs ~6.0 GB for the original BF16 model)

Quantization Details

Component Precision Strategy Group Size Observer Symmetric
Weights FP4 (4-bit float) Per-group 16 MinMax Yes
Activations FP4 (4-bit float) Per-group (dynamic local) 16 MinMax Yes
  • The lm_head layer is not quantized to preserve output quality.
  • Activations use dynamic local quantization for improved accuracy.

How to Use

With vLLM

from vllm import LLM, SamplingParams

model = LLM(model="JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4")

sampling_params = SamplingParams(max_tokens=512, temperature=0.7)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! Can you introduce yourself?"},
]
outputs = model.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! Can you introduce yourself?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Creation

This model was created using llmcompressor with the following recipe:

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

Evaluation

For evaluation results, please refer to the original Qwen2.5-3B-Instruct model card. FP4 W4A4 quantization offers significant compression (~55% size reduction) with some quality trade-off.

Downloads last month
11
Safetensors
Model size
2B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4

Base model

Qwen/Qwen2.5-3B
Quantized
(209)
this model