Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic

This is a quantized version of Qwen/Qwen3-VL-8B-Instruct using Selective Layer Quantization with NVFP4 and FP8_DYNAMIC schemes.

Quantization Strategy

This model uses a hybrid quantization approach optimized for both performance and accuracy:

Component Scheme Details
Attention Layers FP8_DYNAMIC W8A8, preserves precision for Q/K/V/O projections
MLP Layers NVFP4 W4A4, optimizes latency for gate/up/down projections
Vision Encoder BF16 (unquantized) Full precision for visual understanding
LM Head BF16 (unquantized) Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

  • Q/K/V projections ← input_layernorm
  • Gate/Up projections ← post_attention_layernorm

Model Details

  • Base Model: Qwen/Qwen3-VL-8B-Instruct
  • Quantization Method: compressed-tensors (llm-compressor)
  • Model Size: ~7.7 GB (reduced from ~17 GB, ~55% compression)
  • Config Groups: FP8_DYNAMIC (attention) + NVFP4 (MLP)

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --enforce-eager \
    --max-model-len 8192
from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Quantization Config

Target Patterns (vLLM fused-layer compatible)

  • Attention (FP8_DYNAMIC): self_attn.(q_proj|k_proj|v_proj|o_proj|qkv_proj|qkv)
  • MLP (NVFP4): mlp.(gate_proj|up_proj|down_proj|gate_up_proj)
  • Ignored: Vision encoder (visual.*), output head (lm_head), MoE gates (mlp.gate)

Recipe Structure (recipe.yaml)

  • SmoothQuantModifier: Activation smoothing (strength=0.8)
  • QuantizationModifier (attn): FP8_DYNAMIC for attention layers
  • QuantizationModifier (mlp): NVFP4 for MLP layers

Performance Notes

  • Model loads in ~4 seconds, uses ~8 GiB VRAM
  • On 24GB GPU: 10 GiB available for KV cache (159K tokens)
  • GPU without native FP4/FP8 support falls back to Marlin kernels
  • Requires vLLM 0.11+ with compressed-tensors support

License

Apache 2.0, same as the base model.

Downloads last month
111
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic

Quantized
(60)
this model