Qwen2.5-VL-3B-Instruct W8A8 Quantized

This is an INT8 quantized version of Qwen/Qwen2.5-VL-3B-Instruct using W8A8 (8-bit weights, 8-bit activations) quantization scheme.

Model Details

Base Model: Qwen/Qwen2.5-VL-3B-Instruct
Quantization Method: W8A8 INT8 (Weights: INT8, Activations: INT8 Dynamic)
Quantization Library: llm-compressor
Format: compressed-tensors
Compatible Runtime: vLLM

Quantization Details

Weight Quantization: Static per-channel symmetric INT8
Activation Quantization: Dynamic per-token symmetric INT8
Quantization Strategy: Token-wise for activations, channel-wise for weights
Calibration Dataset: flickr30k (64 samples)

Usage

With vLLM

from vllm import LLM, SamplingParams

# Load the quantized model
llm = LLM(
    model="lsm0729/Qwen2.5-VL-3B-Instruct-quantized.w8a8",
    trust_remote_code=True,
    max_model_len=4096,
)

# Generate
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(prompts, sampling_params)

Requirements

pip install vllm>=0.14.0
pip install qwen-vl-utils

Performance

This quantized model provides:

~2x memory reduction compared to FP16
Faster inference with INT8 compute kernels
Minimal accuracy degradation

Citation

If you use this model, please cite the original Qwen2.5-VL paper and model:

@article{qwen2.5-vl,
  title={Qwen2.5-VL: Pushing the Limits of Visual Understanding},
  author={Qwen Team},
  year={2024}
}

License

This quantized model inherits the license from the base model: Apache 2.0

See the original model card for more details.

Acknowledgements

Original model by Qwen Team at Alibaba Cloud
Quantization performed using llm-compressor
Deployed with vLLM for efficient inference

Downloads last month: 10

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for lsm0729/Qwen2.5-VL-3B-Instruct-quantized.w8a8

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Quantized

(70)

this model