Qwen2.5-0.5B-AWQ-FP8-Dynamic / README.md

rtj1

Upload README.md with huggingface_hub

7e527f0 verified 3 months ago

preview code

raw

history blame contribute delete

3.71 kB

metadata

language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - quantization
  - awq
  - fp8
  - llm-compressor
  - vllm
  - model-compression
  - qwen2.5
base_model: Qwen/Qwen2.5-0.5B-Instruct
datasets:
  - gsm8k
model-index:
  - name: Qwen2.5-0.5B-AWQ-FP8-Dynamic
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8K
          type: gsm8k
        metrics:
          - type: exact_match
            value: 22.67
            name: Strict Match
          - type: flexible_extract
            value: 30.78
            name: Flexible Extract

Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC

This is a quantized version of Qwen/Qwen2.5-0.5B-Instruct using AWQ + FP8_DYNAMIC quantization scheme.

Model Details

Base Model: Qwen2.5-0.5B-Instruct
Quantization: AWQ + FP8_DYNAMIC
Size: 0.92 GB (1.2x compression from original 1.1GB)
Precision: FP8 (E4M3)
Parameters: 0.6B

Performance

Evaluated on GSM8K benchmark:

Metric	Score
Strict Match	22.67%
Flexible Extract	30.78%

This outperforms the FP8_BLOCK quantization scheme (17.97% strict match) by ~5% while maintaining the same model size.

Usage

Loading with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")

prompt = "What is 25 * 4?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Loading with vLLM

from vllm import LLM, SamplingParams

llm = LLM(model="rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic")
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)

prompts = ["What is 25 * 4?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Quantization Details

Created using llm-compressor with the FP8_DYNAMIC scheme:

from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.transformers import oneshot

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head"]
)

oneshot(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    recipe=recipe,
    output_dir="Qwen2.5-0.5B-Instruct-awq-fp8-dynamic"
)

Quantization time: ~5 minutes on L4 GPU

Evaluation

Benchmarked using lm-evaluation-harness:

lm_eval \
  --model hf \
  --model_args pretrained=rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic,dtype=auto \
  --tasks gsm8k \
  --batch_size 16

Evaluation time: ~71 minutes on L4 GPU

Related Work

FP8_BLOCK variant - Alternative quantization scheme (17.97% GSM8K strict match)
llm-compressor PR #2330 - Evaluation comparison and reproducible workflow

Hardware Requirements

GPU: Any GPU with FP8 support (L4, A100, H100)
VRAM: ~1GB minimum for inference
Works with Google Colab L4 (22.5GB VRAM)

Citation

@misc{qwen2.5-awq-fp8-dynamic,
  author = {Tharun Jagarlamudi},
  title = {Qwen2.5-0.5B-Instruct AWQ + FP8_DYNAMIC},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/rtj1/Qwen2.5-0.5B-AWQ-FP8-Dynamic}
}

License

Same as base model: Qwen License