Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC

This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_DYNAMIC quantization scheme.

Model Details

  • Base Model: Qwen2.5-0.5B-Instruct
  • Quantization: AWQ + FP8_DYNAMIC
  • Size: 0.92 GB (1.2x compression from original 1.1GB)
  • Precision: FP8 (E4M3)
  • Parameters: 0.6B

Performance

Evaluated on GSM8K benchmark:

Metric Score
Strict Match 22.67%
Flexible Extract 30.78%

This outperforms the FP8_BLOCK quantization scheme (17.97% strict match) by ~5% while maintaining the same model size.

Usage

Loading with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")

prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading with vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Quantization Details

Created using llm-compressor with the FP8_DYNAMIC scheme:

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"]
)

oneshot(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    recipe=recipe,
    output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-dynamic"
)

Quantization time: ~5 minutes on L4 GPU

Evaluation

Benchmarked using lm-evaluation-harness:

lm_eval \
  --model hf \
  --model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic,dtype=auto \
  --tasks gsm8k \
  --batch_size 16

Evaluation time: ~71 minutes on L4 GPU

Related Work

Hardware Requirements

  • GPU: Any GPU with FP8 support (L4, A100, H100)
  • VRAM: ~1GB minimum for inference
  • Works with Google Colab L4 (22.5GB VRAM)

Citation

@misc{qwen2.5-awq-fp8-dynamic,
  author = {Tharun Jagarlamudi},
  title = {Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic}
}

License

Same as base model: Qwen License

Downloads last month
20
Safetensors
Model size
0.6B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic

Base model

Qwen/Qwen2.5-0.5B
Quantized
(167)
this model

Dataset used to train rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic

Evaluation results