Qwen3VL-8B-Instruct-FP8
This is an FP8 quantized version of Qwen3-VL-8B-Instruct, a powerful vision-language model for multimodal understanding and generation tasks.
Model Details
Base Model
- Base Model: Qwen/Qwen3-VL-8B-Instruct
- Architecture: Qwen3VLForConditionalGeneration
- Model Type: Vision-Language Model (VLM)
Quantization Details
- Quantization Method: FP8_DYNAMIC with SmoothQuant
- Quantization Tool: llmcompressor
- Smoothing Strength: 0.8
- Calibration Dataset: lmms-lab/flickr30k (512 samples from test split)
- Max Sequence Length: 32,768 tokens
Quantization Configuration
- Weight Quantization: FP8 (8-bit floating point)
- Strategy: Channel-wise
- Observer: MinMax
- Symmetric: True
- Activation Quantization: FP8 Dynamic
- Strategy: Token-wise
- Dynamic scaling: Enabled
- Symmetric: True
Excluded Modules
The following modules were excluded from quantization to maintain model quality:
lm_head(language model head)- Visual encoder modules (
model.visual.*) - MLP gate projections (
.*mlp.gate$)
Model Use
Installation
pip install transformers torch qwen-vl-utils pillow
Basic Usage
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from PIL import Image
import requests
# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
"JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8")
# Prepare inputs
image_url = "http://images.cocodataset.org/train2017/000000231895.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "What does the image show?"},
],
}
]
# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=[prompt],
images=[image],
padding=False,
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)
Using with vLLM
For faster inference, you can use this model with vLLM:
from vllm import LLM
from vllm.multimodal.utils import encode_image_base64
from PIL import Image
import base64
from io import BytesIO
# Initialize vLLM engine
llm = LLM(
model="JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
max_model_len=8192,
limit_mm_per_prompt={"image": 1, "video": 0},
trust_remote_code=True,
)
# Prepare image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()
# Generate
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": f"data:image/png;base64,{img_str}"},
{"type": "text", "text": "Describe this image."},
],
}
]
outputs = llm.chat(messages, max_tokens=100)
print(outputs[0].outputs[0].text)
Performance
Memory Benefits
- Reduced Memory Footprint: FP8 quantization reduces model size and memory requirements compared to the full-precision model
- Faster Inference: Lower precision enables faster computation on modern GPUs with FP8 support
Quality
This quantized model maintains high quality for vision-language tasks while significantly reducing memory usage. The SmoothQuant technique helps preserve model accuracy during quantization.
Training Details
Quantization Process
- Calibration: Used 512 samples from the flickr30k test dataset
- SmoothQuant: Applied with smoothing strength of 0.8 to improve quantization quality
- Sequential Processing: Applied quantization sequentially to Qwen3VLTextDecoderLayer modules
Hardware
- Quantization was performed on NVIDIA GPUs with CUDA support
Limitations
- This is a quantized model, so there may be slight quality degradation compared to the full-precision base model
- FP8 support requires compatible hardware (e.g., NVIDIA H100, A100 with appropriate CUDA versions)
- Maximum sequence length is limited to 32,768 tokens
Citation
If you use this model, please cite the original Qwen3-VL model:
@article{qwen3vl,
title={Qwen3-VL: A Versatile Vision-Language Model},
author={Qwen Team},
journal={arXiv preprint},
year={2024}
}
License
This model inherits the license from the base model Qwen/Qwen3-VL-8B-Instruct. Please refer to the original model's license for details.
Acknowledgments
- Base model: Qwen Team
- Quantization tool: llmcompressor by vLLM Project
- Calibration dataset: flickr30k
- Downloads last month
- 67
Model tree for JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8
Base model
Qwen/Qwen3-VL-8B-Instruct