Qwen2.5-VL-7B-Instruct W8A8 Quantized
This is an INT8 quantized version of Qwen/Qwen2.5-VL-7B-Instruct using W8A8 (8-bit weights, 8-bit activations) quantization scheme.
Model Details
- Base Model: Qwen/Qwen2.5-VL-7B-Instruct
- Quantization Method: W8A8 INT8 (Weights: INT8, Activations: INT8 Dynamic)
- Quantization Library: llm-compressor
- Format: compressed-tensors
- Compatible Runtime: vLLM
Quantization Details
- Weight Quantization: Static per-channel symmetric INT8
- Activation Quantization: Dynamic per-token symmetric INT8
- Quantization Strategy: Token-wise for activations, channel-wise for weights
- Calibration Dataset: flickr30k (64 samples)
Usage
With vLLM
from vllm import LLM, SamplingParams
# Load the quantized model
llm = LLM(
model="lsm0729/Qwen2.5-VL-7B-Instruct-quantized.w8a8",
trust_remote_code=True,
max_model_len=4096,
)
# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)
Requirements
pip install vllm>=0.14.0
pip install qwen-vl-utils
Performance
This quantized model provides:
- ~2x memory reduction compared to FP16
- Faster inference with INT8 compute kernels
- Minimal accuracy degradation
Citation
If you use this model, please cite the original Qwen2.5-VL paper and model:
@article{qwen2.5-vl,
title={Qwen2.5-VL: Pushing the Limits of Visual Understanding},
author={Qwen Team},
year={2024}
}
License
This quantized model inherits the license from the base model: Apache 2.0
See the original model card for more details.
Acknowledgements
- Original model by Qwen Team at Alibaba Cloud
- Quantization performed using llm-compressor
- Deployed with vLLM for efficient inference
- Downloads last month
- 66
Model tree for lsm0729/Qwen2.5-VL-7B-Instruct-quantized.w8a8
Base model
Qwen/Qwen2.5-VL-7B-Instruct