Qwen3VL-8B-Instruct-FP8

This is an FP8 quantized version of Qwen3-VL-8B-Instruct, a powerful vision-language model for multimodal understanding and generation tasks.

Model Details

Base Model

  • Base Model: Qwen/Qwen3-VL-8B-Instruct
  • Architecture: Qwen3VLForConditionalGeneration
  • Model Type: Vision-Language Model (VLM)

Quantization Details

  • Quantization Method: FP8_DYNAMIC with SmoothQuant
  • Quantization Tool: llmcompressor
  • Smoothing Strength: 0.8
  • Calibration Dataset: lmms-lab/flickr30k (512 samples from test split)
  • Max Sequence Length: 32,768 tokens

Quantization Configuration

  • Weight Quantization: FP8 (8-bit floating point)
    • Strategy: Channel-wise
    • Observer: MinMax
    • Symmetric: True
  • Activation Quantization: FP8 Dynamic
    • Strategy: Token-wise
    • Dynamic scaling: Enabled
    • Symmetric: True

Excluded Modules

The following modules were excluded from quantization to maintain model quality:

  • lm_head (language model head)
  • Visual encoder modules (model.visual.*)
  • MLP gate projections (.*mlp.gate$)

Model Use

Installation

pip install transformers torch qwen-vl-utils pillow

Basic Usage

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from PIL import Image
import requests

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8")

# Prepare inputs
image_url = "http://images.cocodataset.org/train2017/000000231895.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_url},
            {"type": "text", "text": "What does the image show?"},
        ],
    }
]

# Process and generate
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[prompt],
    images=[image],
    padding=False,
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=100, temperature=0.7)
generated_text = processor.decode(output[0], skip_special_tokens=True)
print(generated_text)

Using with vLLM

For faster inference, you can use this model with vLLM:

from vllm import LLM
from vllm.multimodal.utils import encode_image_base64
from PIL import Image
import base64
from io import BytesIO

# Initialize vLLM engine
llm = LLM(
    model="JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8",
    max_model_len=8192,
    limit_mm_per_prompt={"image": 1, "video": 0},
    trust_remote_code=True,
)

# Prepare image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue()).decode()

# Generate
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": f"data:image/png;base64,{img_str}"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

outputs = llm.chat(messages, max_tokens=100)
print(outputs[0].outputs[0].text)

Performance

Memory Benefits

  • Reduced Memory Footprint: FP8 quantization reduces model size and memory requirements compared to the full-precision model
  • Faster Inference: Lower precision enables faster computation on modern GPUs with FP8 support

Quality

This quantized model maintains high quality for vision-language tasks while significantly reducing memory usage. The SmoothQuant technique helps preserve model accuracy during quantization.

Training Details

Quantization Process

  1. Calibration: Used 512 samples from the flickr30k test dataset
  2. SmoothQuant: Applied with smoothing strength of 0.8 to improve quantization quality
  3. Sequential Processing: Applied quantization sequentially to Qwen3VLTextDecoderLayer modules

Hardware

  • Quantization was performed on NVIDIA GPUs with CUDA support

Limitations

  • This is a quantized model, so there may be slight quality degradation compared to the full-precision base model
  • FP8 support requires compatible hardware (e.g., NVIDIA H100, A100 with appropriate CUDA versions)
  • Maximum sequence length is limited to 32,768 tokens

Citation

If you use this model, please cite the original Qwen3-VL model:

@article{qwen3vl,
  title={Qwen3-VL: A Versatile Vision-Language Model},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2024}
}

License

This model inherits the license from the base model Qwen/Qwen3-VL-8B-Instruct. Please refer to the original model's license for details.

Acknowledgments

Downloads last month
67
Safetensors
Model size
9B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JEILDLWLRMA/Qwen3VL-8B-Instruct-FP8

Quantized
(56)
this model