Qwen2.5-3B-Instruct-FP4-W4A4
Model Overview
This model is a quantized version of Qwen/Qwen2.5-3B-Instruct using NVFP4 W4A4 quantization with the compressed-tensors format.
- Quantization Method: NVFP4 (nvfp4-pack-quantized), W4A4 (4-bit weights, 4-bit activations)
- Quantization Strategy: Per-group (group size 16), symmetric
- Calibration Dataset: CNN/DailyMail (512 samples, max sequence length 2048)
- Format: compressed-tensors
- Model Size: ~2.7 GB (vs ~6.0 GB for the original BF16 model)
Quantization Details
| Component | Precision | Strategy | Group Size | Observer | Symmetric |
|---|---|---|---|---|---|
| Weights | FP4 (4-bit float) | Per-group | 16 | MinMax | Yes |
| Activations | FP4 (4-bit float) | Per-group (dynamic local) | 16 | MinMax | Yes |
- The
lm_headlayer is not quantized to preserve output quality. - Activations use dynamic local quantization for improved accuracy.
How to Use
With vLLM
from vllm import LLM, SamplingParams
model = LLM(model="JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4")
sampling_params = SamplingParams(max_tokens=512, temperature=0.7)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! Can you introduce yourself?"},
]
outputs = model.chat(messages, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "JongYeop/Qwen2.5-3B-Instruct-FP4-W4A4"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! Can you introduce yourself?"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
Creation
This model was created using llmcompressor with the following recipe:
quant_stage:
quant_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head]
scheme: NVFP4
Evaluation
For evaluation results, please refer to the original Qwen2.5-3B-Instruct model card. FP4 W4A4 quantization offers significant compression (~55% size reduction) with some quality trade-off.
- Downloads last month
- 11