Qwen2.5-VL-3B-Instruct W8A8 Quantized

This is an INT8 quantized version of Qwen/Qwen2.5-VL-3B-Instruct using W8A8 (8-bit weights, 8-bit activations) quantization scheme.

Model Details

Quantization Details

  • Weight Quantization: Static per-channel symmetric INT8
  • Activation Quantization: Dynamic per-token symmetric INT8
  • Quantization Strategy: Token-wise for activations, channel-wise for weights
  • Calibration Dataset: flickr30k (64 samples)

Usage

With vLLM

from vllm import LLM, SamplingParams

# Load the quantized model
llm = LLM(
    model="lsm0729/Qwen2.5-VL-3B-Instruct-quantized.w8a8",
    trust_remote_code=True,
    max_model_len=4096,
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)

Requirements

pip install vllm>=0.14.0
pip install qwen-vl-utils

Performance

This quantized model provides:

  • ~2x memory reduction compared to FP16
  • Faster inference with INT8 compute kernels
  • Minimal accuracy degradation

Citation

If you use this model, please cite the original Qwen2.5-VL paper and model:

@article{qwen2.5-vl,
  title={Qwen2.5-VL: Pushing the Limits of Visual Understanding},
  author={Qwen Team},
  year={2024}
}

License

This quantized model inherits the license from the base model: Apache 2.0

See the original model card for more details.

Acknowledgements

  • Original model by Qwen Team at Alibaba Cloud
  • Quantization performed using llm-compressor
  • Deployed with vLLM for efficient inference
Downloads last month
10
Safetensors
Model size
4B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lsm0729/Qwen2.5-VL-3B-Instruct-quantized.w8a8

Quantized
(70)
this model