Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC
This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_DYNAMIC quantization scheme.
Model Details
- Base Model: Qwen2.5-0.5B-Instruct
- Quantization: AWQ + FP8_DYNAMIC
- Size: 0.92 GB (1.2x compression from original 1.1GB)
- Precision: FP8 (E4M3)
- Parameters: 0.6B
Performance
Evaluated on GSM8K benchmark:
| Metric | Score |
|---|---|
| Strict Match | 22.67% |
| Flexible Extract | 30.78% |
This outperforms the FP8_BLOCK quantization scheme (17.97% strict match) by ~5% while maintaining the same model size.
Usage
Loading with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")
prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Loading with vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Quantization Details
Created using llm-compressor with the FP8_DYNAMIC scheme:
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head"]
)
oneshot(
model="Qwen/Qwen2.5-0.5B-Instruct",
recipe=recipe,
output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-dynamic"
)
Quantization time: ~5 minutes on L4 GPU
Evaluation
Benchmarked using lm-evaluation-harness:
lm_eval \
--model hf \
--model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic,dtype=auto \
--tasks gsm8k \
--batch_size 16
Evaluation time: ~71 minutes on L4 GPU
Related Work
- FP8_BLOCK variant - Alternative quantization scheme (17.97% GSM8K strict match)
- llm-compressor PR #2330 - Evaluation comparison and reproducible workflow
Hardware Requirements
- GPU: Any GPU with FP8 support (L4, A100, H100)
- VRAM: ~1GB minimum for inference
- Works with Google Colab L4 (22.5GB VRAM)
Citation
@misc{qwen2.5-awq-fp8-dynamic,
author = {Tharun Jagarlamudi},
title = {Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic}
}
License
Same as base model: Qwen License
- Downloads last month
- 20
Model tree for rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic
Dataset used to train rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic
Evaluation results
- Strict Match on GSM8Kself-reported22.670
- Flexible Extract on GSM8Kself-reported30.780