Qwen-3-VL Collection
Collection
Quantized Qwen3-VL models for efficient image-text understanding (AutoRound W4A16). • 9 items • Updated
This model is the AWQ (Activation-aware Weight Quantization) export of Qwen/Qwen3-VL-8B-Instruct.
It combines the speed of AWQ with the accuracy of AutoRound. The weights were fine-tuned for 800 steps to ensure the 4-bit degradation is negligible. The vision tower remains in full precision (FP16) to maintain top-tier performance on OCR and visual tasks.
vLLM serving and Transformers.pip install vllm
from vllm import LLM, SamplingParams
model_id = "Vishva007/Qwen3-VL-8B-Instruct-W4A16-AutoRound-AWQ"
llm = LLM(
model=model_id,
quantization="awq",
trust_remote_code=True,
max_model_len=4096
)
# ... (Standard vLLM inference code)
pip install autoawq transformers
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
import torch
model_id = "Vishva007/Qwen3-VL-8B-Instruct-W4A16-AutoRound-AWQ"
# Load with Flash Attention 2 for best performance
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained(model_id)
# Inference Example
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
{"type": "text", "text": "What does this image show?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
print(processor.batch_decode(generated_ids, skip_special_tokens=True))
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025}
}
Base model
Qwen/Qwen3-VL-8B-Instruct