Qwen2.5-0.5B-Instruct AWQ + FP8_BLOCK

This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_BLOCK quantization scheme.

Model Details

  • Base Model: Qwen2.5-0.5B-Instruct
  • Quantization: AWQ + FP8_BLOCK
  • Size: 0.92 GB (1.2x compression from original 1.1GB)
  • Precision: FP8 (E4M3)
  • Parameters: 0.6B

Performance

Evaluated on GSM8K benchmark:

Metric Score
Strict Match 17.97%
Flexible Extract 29.80%

For better accuracy, use the FP8_DYNAMIC variant which achieves 22.67% strict match.

Usage

Loading with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "rtj1/Qwen2.5-0.5B-AWQ-FP8-Block",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")

prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading with vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Block")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Quantization Details

Created using llm-compressor with the FP8_BLOCK scheme:

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_BLOCK",
    ignore=["lm_head"]
)

oneshot(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    recipe=recipe,
    output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-block"
)

Quantization time: ~4 minutes on L4 GPU

Evaluation

Benchmarked using lm-evaluation-harness:

lm_eval \
  --model hf \
  --model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Block,dtype=auto \
  --tasks gsm8k \
  --batch_size 16

Evaluation time: ~82 minutes on L4 GPU

Related Work

Hardware Requirements

  • GPU: Any GPU with FP8 support (L4, A100, H100)
  • VRAM: ~1GB minimum for inference
  • Works with Google Colab L4 (22.5GB VRAM)

Citation

@misc{qwen2.5-awq-fp8-block,
  author = {Tharun Jagarlamudi},
  title = {Qwen2.5-0.5B-Instruct AWQ + FP8_BLOCK},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Block}
}

License

Same as base model: Qwen License

Downloads last month
19
Safetensors
Model size
0.6B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rtj1/Qwen2.5-0.5B-AWQ-FP8-Block

Base model

Qwen/Qwen2.5-0.5B
Quantized
(167)
this model

Dataset used to train rtj1/Qwen2.5-0.5B-AWQ-FP8-Block

Evaluation results