Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic
This is a quantized version of Qwen/Qwen3-VL-8B-Instruct using Selective Layer Quantization with NVFP4 and FP8_DYNAMIC schemes.
Quantization Strategy
This model uses a hybrid quantization approach optimized for both performance and accuracy:
| Component | Scheme | Details |
|---|---|---|
| Attention Layers | FP8_DYNAMIC | W8A8, preserves precision for Q/K/V/O projections |
| MLP Layers | NVFP4 | W4A4, optimizes latency for gate/up/down projections |
| Vision Encoder | BF16 (unquantized) | Full precision for visual understanding |
| LM Head | BF16 (unquantized) | Full precision for output quality |
SmoothQuant
Applied with strength 0.8 for activation smoothing before quantization:
- Q/K/V projections ← input_layernorm
- Gate/Up projections ← post_attention_layernorm
Model Details
- Base Model: Qwen/Qwen3-VL-8B-Instruct
- Quantization Method: compressed-tensors (llm-compressor)
- Model Size: ~7.7 GB (reduced from ~17 GB, ~55% compression)
- Config Groups: FP8_DYNAMIC (attention) + NVFP4 (MLP)
Usage with vLLM
vllm serve JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic \
--quantization compressed-tensors \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--enforce-eager \
--max-model-len 8192
from vllm import LLM, SamplingParams
llm = LLM(
model="JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic",
quantization="compressed-tensors",
trust_remote_code=True,
kv_cache_dtype="fp8",
max_model_len=8192,
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)
Usage with Transformers
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch
model_id = "JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
Quantization Config
Target Patterns (vLLM fused-layer compatible)
- Attention (FP8_DYNAMIC):
self_attn.(q_proj|k_proj|v_proj|o_proj|qkv_proj|qkv) - MLP (NVFP4):
mlp.(gate_proj|up_proj|down_proj|gate_up_proj) - Ignored: Vision encoder (
visual.*), output head (lm_head), MoE gates (mlp.gate)
Recipe Structure (recipe.yaml)
- SmoothQuantModifier: Activation smoothing (strength=0.8)
- QuantizationModifier (attn): FP8_DYNAMIC for attention layers
- QuantizationModifier (mlp): NVFP4 for MLP layers
Performance Notes
- Model loads in ~4 seconds, uses ~8 GiB VRAM
- On 24GB GPU:
10 GiB available for KV cache (159K tokens) - GPU without native FP4/FP8 support falls back to Marlin kernels
- Requires vLLM 0.11+ with compressed-tensors support
License
Apache 2.0, same as the base model.
- Downloads last month
- 111
Model tree for JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic
Base model
Qwen/Qwen3-VL-8B-Instruct