Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic

This is a quantized version of Qwen/Qwen3-VL-8B-Instruct using Selective Layer Quantization with NVFP4 and FP8_DYNAMIC schemes.

Quantization Strategy

This model uses a hybrid quantization approach optimized for both performance and accuracy:

Component	Scheme	Details
Attention Layers	FP8_DYNAMIC	W8A8, preserves precision for Q/K/V/O projections
MLP Layers	NVFP4	W4A4, optimizes latency for gate/up/down projections
Vision Encoder	BF16 (unquantized)	Full precision for visual understanding
LM Head	BF16 (unquantized)	Full precision for output quality

SmoothQuant

Applied with strength 0.8 for activation smoothing before quantization:

Q/K/V projections ← input_layernorm
Gate/Up projections ← post_attention_layernorm

Model Details

Base Model: Qwen/Qwen3-VL-8B-Instruct
Quantization Method: compressed-tensors (llm-compressor)
Model Size: ~7.7 GB (reduced from ~17 GB, ~55% compression)
Config Groups: FP8_DYNAMIC (attention) + NVFP4 (MLP)

Usage with vLLM

vllm serve JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic \
    --quantization compressed-tensors \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.9 \
    --enforce-eager \
    --max-model-len 8192

from vllm import LLM, SamplingParams

llm = LLM(
    model="JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic",
    quantization="compressed-tensors",
    trust_remote_code=True,
    kv_cache_dtype="fp8",
    max_model_len=8192,
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.95)
prompts = ["Your prompt here"]
outputs = llm.generate(prompts, sampling_params)

Usage with Transformers

from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
import torch

model_id = "JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

Quantization Config

Target Patterns (vLLM fused-layer compatible)

Attention (FP8_DYNAMIC): self_attn.(q_proj|k_proj|v_proj|o_proj|qkv_proj|qkv)
MLP (NVFP4): mlp.(gate_proj|up_proj|down_proj|gate_up_proj)
Ignored: Vision encoder (visual.*), output head (lm_head), MoE gates (mlp.gate)

Recipe Structure (`recipe.yaml`)

SmoothQuantModifier: Activation smoothing (strength=0.8)
QuantizationModifier (attn): FP8_DYNAMIC for attention layers
QuantizationModifier (mlp): NVFP4 for MLP layers

Performance Notes

Model loads in ~4 seconds, uses ~8 GiB VRAM
On 24GB GPU: ~~10 GiB available for KV cache (~~159K tokens)
GPU without native FP4/FP8 support falls back to Marlin kernels
Requires vLLM 0.11+ with compressed-tensors support

License

Apache 2.0, same as the base model.

Downloads last month: 172

Safetensors

Model size

6B params

Tensor type

F32

BF16

F8_E4M3

Model tree for JEILDLWLRMA/Qwen3-VL-8B-Instruct-NVFP4-FP8-Dynamic

Base model

Qwen/Qwen3-VL-8B-Instruct

Quantized

(87)

this model