Qwen3VL-8B-Instruct-FP8

This is an FP8 quantized version of Qwen3-VL-8B-Instruct, a powerful vision-language model for multimodal understanding and generation tasks.

Model Details

Base Model

Base Model: Qwen/Qwen3-VL-8B-Instruct
Architecture: Qwen3VLForConditionalGeneration
Model Type: Vision-Language Model (VLM)

Quantization Details

Quantization Method: FP8_DYNAMIC with SmoothQuant
Quantization Tool: llmcompressor
Smoothing Strength: 0.8
Calibration Dataset: lmms-lab/flickr30k (512 samples from test split)
Max Sequence Length: 32,768 tokens

Quantization Configuration

Weight Quantization: FP8 (8-bit floating point)
- Strategy: Channel-wise
- Observer: MinMax
- Symmetric: True
Activation Quantization: FP8 Dynamic
- Strategy: Token-wise
- Dynamic scaling: Enabled
- Symmetric: True

Excluded Modules

The following modules were excluded from quantization to maintain model quality:

lm_head (language model head)
Visual encoder modules (model.visual.*)
MLP gate projections (.*mlp.gate$)

Model Use

Installation

pip install transformers torch qwen-vl-utils pillow

Basic Usage

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from PIL import Image
import requests

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8")

# Prepare inputs
image_url = "http://images.cocodataset.org/train2017/000000231895.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_url},
            {"type": "text", "text": "What does the image show?"},
        ],
    }
]

# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[prompt],
    images=[image],
    padding=False,
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)

Using with vLLM

For faster inference, you can use this model with vLLM:

from vllm import LLM
from vllm.multimodal.utils import encode_image_base64
from PIL import Image
import base64
from io import BytesIO

# Initialize vLLM engine
llm = LLM(
    model="JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
    max_model_len=8192,
    limit_mm_per_prompt={"image": 1, "video": 0},
    trust_remote_code=True,
)

# Prepare image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()

# Generate
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"data:image/png;base64,{img_str}"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

outputs = llm.chat(messages, max_tokens=100)
print(outputs[0].outputs[0].text)

Performance

Memory Benefits

Reduced Memory Footprint: FP8 quantization reduces model size and memory requirements compared to the full-precision model
Faster Inference: Lower precision enables faster computation on modern GPUs with FP8 support

Quality

This quantized model maintains high quality for vision-language tasks while significantly reducing memory usage. The SmoothQuant technique helps preserve model accuracy during quantization.

Training Details

Quantization Process

Calibration: Used 512 samples from the flickr30k test dataset
SmoothQuant: Applied with smoothing strength of 0.8 to improve quantization quality
Sequential Processing: Applied quantization sequentially to Qwen3VLTextDecoderLayer modules

Hardware

Quantization was performed on NVIDIA GPUs with CUDA support

Limitations

This is a quantized model, so there may be slight quality degradation compared to the full-precision base model
FP8 support requires compatible hardware (e.g., NVIDIA H100, A100 with appropriate CUDA versions)
Maximum sequence length is limited to 32,768 tokens

Citation

If you use this model, please cite the original Qwen3-VL model:

@article{qwen3vl,
  title={Qwen3-VL: A Versatile Vision-Language Model},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}

License

This model inherits the license from the base model Qwen/Qwen3-VL-8B-Instruct. Please refer to the original model's license for details.

Acknowledgments

Base model: Qwen Team
Quantization tool: llmcompressor by vLLM Project
Calibration dataset: flickr30k

Downloads last month: 12

Safetensors

Model size

9B params

Tensor type

BF16

F8_E4M3

Model tree for JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(87)

this model