lsm0729's picture
Upload W8A8 quantized Qwen2.5-VL-7B-Instruct model
e00ecf0 verified
metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
tags:
  - quantized
  - int8
  - w8a8
  - vllm
  - compressed-tensors
  - vision-language
library_name: transformers
pipeline_tag: image-text-to-text

Qwen2.5-VL-7B-Instruct W8A8 Quantized

This is an INT8 quantized version of Qwen/Qwen2.5-VL-7B-Instruct using W8A8 (8-bit weights, 8-bit activations) quantization scheme.

Model Details

Quantization Details

  • Weight Quantization: Static per-channel symmetric INT8
  • Activation Quantization: Dynamic per-token symmetric INT8
  • Quantization Strategy: Token-wise for activations, channel-wise for weights
  • Calibration Dataset: flickr30k (64 samples)

Usage

With vLLM

from vllm import LLM, SamplingParams

# Load the quantized model
llm = LLM(
    model="lsm0729/Qwen2.5-VL-7B-Instruct-quantized.w8a8",
    trust_remote_code=True,
    max_model_len=4096,
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)

Requirements

pip install vllm>=0.14.0
pip install qwen-vl-utils

Performance

This quantized model provides:

  • ~2x memory reduction compared to FP16
  • Faster inference with INT8 compute kernels
  • Minimal accuracy degradation

Citation

If you use this model, please cite the original Qwen2.5-VL paper and model:

@article{qwen2.5-vl,
  title={Qwen2.5-VL: Pushing the Limits of Visual Understanding},
  author={Qwen Team},
  year={2024}
}

License

This quantized model inherits the license from the base model: Apache 2.0

See the original model card for more details.

Acknowledgements

  • Original model by Qwen Team at Alibaba Cloud
  • Quantization performed using llm-compressor
  • Deployed with vLLM for efficient inference