Qwen2.5-3B-Instruct-FP4-W4A4

Model Overview

This model is a quantized version of Qwen/Qwen2.5-3B-Instruct using NVFP4 W4A4 quantization with the compressed-tensors format.

Quantization Method: NVFP4 (nvfp4-pack-quantized), W4A4 (4-bit weights, 4-bit activations)
Quantization Strategy: Per-group (group size 16), symmetric
Calibration Dataset: CNN/DailyMail (512 samples, max sequence length 2048)
Format: compressed-tensors
Model Size: ~2.7 GB (vs ~6.0 GB for the original BF16 model)

Quantization Details

Component	Precision	Strategy	Group Size	Observer	Symmetric
Weights	FP4 (4-bit float)	Per-group	16	MinMax	Yes
Activations	FP4 (4-bit float)	Per-group (dynamic local)	16	MinMax	Yes

The lm_head layer is not quantized to preserve output quality.
Activations use dynamic local quantization for improved accuracy.

How to Use

With vLLM

from vllm import LLM, SamplingParams

model = LLM(model="JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4")

sampling_params = SamplingParams(max_tokens=512, temperature=0.7)
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! Can you introduce yourself?"},
]
outputs = model.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello! Can you introduce yourself?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Creation

This model was created using llmcompressor with the following recipe:

quant_stage:
  quant_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore: [lm_head]
      scheme: NVFP4

Evaluation

For evaluation results, please refer to the original Qwen2.5-3B-Instruct model card. FP4 W4A4 quantization offers significant compression (~55% size reduction) with some quality trade-off.

Downloads last month: 11

Safetensors

Model size

2B params

Tensor type

F32

BF16

F8_E4M3

Model tree for JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Quantized

(209)

this model