Qwen2.5-0.5B-Instruct AWQ + FP8_BLOCK
This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_BLOCK quantization scheme.
Model Details
- Base Model: Qwen2.5-0.5B-Instruct
- Quantization: AWQ + FP8_BLOCK
- Size: 0.92 GB (1.2x compression from original 1.1GB)
- Precision: FP8 (E4M3)
- Parameters: 0.6B
Performance
Evaluated on GSM8K benchmark:
| Metric | Score |
|---|---|
| Strict Match | 17.97% |
| Flexible Extract | 29.80% |
For better accuracy, use the FP8_DYNAMIC variant which achieves 22.67% strict match.
Usage
Loading with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Loading with vLLM
from vllm import LLM, SamplingParams
llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Quantization Details
Created using llm-compressor with the FP8_BLOCK scheme:
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_BLOCK",
ignore=["lm_head"]
)
oneshot(
model="Qwen/Qwen2.5-0.5B-Instruct",
recipe=recipe,
output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-block"
)
Quantization time: ~4 minutes on L4 GPU
Evaluation
Benchmarked using lm-evaluation-harness:
lm_eval \
--model hf \
--model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Block,dtype=auto \
--tasks gsm8k \
--batch_size 16
Evaluation time: ~82 minutes on L4 GPU
Related Work
- FP8_DYNAMIC variant - Recommended (22.67% GSM8K strict match)
- llm-compressor PR #2330 - Evaluation comparison and reproducible workflow
Hardware Requirements
- GPU: Any GPU with FP8 support (L4, A100, H100)
- VRAM: ~1GB minimum for inference
- Works with Google Colab L4 (22.5GB VRAM)
Citation
@misc{qwen2.5-awq-fp8-block,
author = {Tharun Jagarlamudi},
title = {Qwen2.5-0.5B-Instruct AWQ + FP8_BLOCK},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Block}
}
License
Same as base model: Qwen License
- Downloads last month
- 19
Model tree for rtj1/Qwen2.5-0.5B-AWQ-FP8-Block
Dataset used to train rtj1/Qwen2.5-0.5B-AWQ-FP8-Block
Evaluation results
- Strict Match on GSM8Kself-reported17.970
- Flexible Extract on GSM8Kself-reported29.800