Qwen3-VL-8B-Thinking NVFP4A16 — vLLM / RTX 5060 Ti Test

This repository contains a locally quantized Qwen3-VL-8B-Thinking model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.

This model was prepared for vLLM inference with multimodal support enabled. The language backbone was quantized to NVFP4A16, while the visual encoder was kept in BF16 to avoid Marlin tile-size incompatibilities in Qwen3-VL visual MLP layers.

This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.

Quantization Summary

Item Value
Base model Qwen3-VL-8B-Thinking
Quantization format NVFP4A16
Runtime quantization loader compressed-tensors
Runtime vLLM OpenAI server
vLLM version tested 0.21.1rc1.dev46+gb50646e5e
Model architecture resolved by vLLM Qwen3VLForConditionalGeneration
Runtime dtype torch.bfloat16
KV cache dtype auto
Tensor parallel size 1
Pipeline parallel size 1
Data parallel size 1
Multimodal support image, video, audio prompt limits configured
Language backbone NVFP4A16
Visual encoder BF16 / not quantized
lm_head BF16 / not quantized

Why the Visual Encoder Was Not Quantized

During the initial NVFP4 conversion, vLLM failed when preparing some Qwen3-VL visual MLP layers for Marlin FP4 execution. The problematic layers had output dimensions such as:

model.visual.blocks.*.mlp.linear_fc1 | out_features=4304

Marlin FP4 packing requires compatible tile dimensions, and 4304 is not divisible by 64. To make the model load successfully in vLLM, the visual encoder was excluded from NVFP4 quantization.

The final working layout is:

Language backbone: NVFP4A16
Visual encoder: BF16
lm_head: BF16
Runtime: vLLM compressed-tensors

Tested Hardware

Component Configuration
GPU NVIDIA GeForce RTX 5060 Ti
GPU VRAM 16 GB
CPU Intel Xeon E5-2680 v4
System RAM 64 GB
Runtime Docker + NVIDIA Container Runtime
Container image vllm/vllm-openai:nightly-x86_64

vLLM Runtime Configuration

model: /models/Qwen3-VL-8B-Thinking-NVFP4
dtype: auto
gpu_memory_utilization: 0.93
max_model_len: 32768
max_num_batched_tokens: 8192
max_num_seqs: 4

enable_auto_tool_choice: true
tool_call_parser: qwen3_xml
reasoning_parser: qwen3

enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: false
trust_remote_code: true
tensor_parallel_size: 1

port: 8000
served_model_name:
  - Qwen3-VL

structured_outputs_config:
  backend: xgrammar
  disable_any_whitespace: true
  enable_in_reasoning: false

enable_server_load_tracking: true
kv_cache_metrics: true
kv_cache_metrics_sample: 0.05

limit_mm_per_prompt:
  image: 3
  video: 1
  audio: 1

mm_processor_kwargs:
  min_pixels: 3136
  max_pixels: 10035200
  fps: 1.0

media_io_kwargs:
  video:
    num_frames: 32
    fps: 2

data_parallel_size: 1
video_pruning_rate: 0.75

mm_encoder_tp_mode: data
mm_processor_cache_type: shm

kv_cache_dtype: auto
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash

async_scheduling: true

compilation_config:
  pass_config:
    fuse_allreduce_rms: true
    fuse_attn_quant: true
    eliminate_noops: true

generation_config: vllm
skip_mm_profiling: true

Startup and Memory Statistics

Metric Result
Checkpoint size reported by vLLM 6.62 GiB
Available system RAM reported by vLLM 40.09 GiB
Weight loading time 4.10 s
Model loading GPU memory 7.06 GiB
Model loading time 6.54 s
Available KV cache memory 6.24 GiB
GPU KV cache size 45,456 tokens
Max configured context length 32,768 tokens
Estimated max concurrency at full context 1.39x
torch.compile total time 117.10 s
Engine init time 127.30 s
Multimodal warmup time 9.966 s
Read-only multimodal warmup time 0.312 s
Server port 8000

Throughput Observed

Scenario Prompt throughput Generation throughput Running requests GPU KV usage Prefix cache hit MM cache hit
Initial request / JIT warmup 2.3 tok/s 2.9 tok/s 1 0.1% 0.0% N/A
Single request 0.0 tok/s 73.2 tok/s 1 1.7% 0.0% N/A
Prompt-heavy request 54.8 tok/s 67.3 tok/s 1 2.1% 0.0% N/A
Single request 0.0 tok/s 71.1 tok/s 1 3.6% 0.0% N/A
Multimodal / two running requests 113.0 tok/s 64.3 tok/s 2 7.6% 0.9% 0.0%
Two running requests 0.0 tok/s 132.2 tok/s 2 10.5% 0.9% 0.0%
Two running requests 0.0 tok/s 128.2 tok/s 2 13.3% 0.9% 0.0%
Two running requests 0.0 tok/s 124.8 tok/s 2 16.1% 0.9% 0.0%
Two running requests 0.0 tok/s 121.4 tok/s 2 18.7% 0.9% 0.0%
Two running requests 0.0 tok/s 118.4 tok/s 2 21.3% 0.9% 0.0%
Two running requests 0.0 tok/s 115.6 tok/s 2 23.9% 0.9% 0.0%
Single request 0.0 tok/s 79.8 tok/s 1 12.2% 0.9% 0.0%
Single request 0.0 tok/s 64.1 tok/s 1 13.6% 0.9% 0.0%

Runtime Behavior

  • vLLM successfully resolved the architecture as Qwen3VLForConditionalGeneration.
  • vLLM detected the model as quantization=compressed-tensors.
  • FlashAttention was used for the main attention backend.
  • FlashAttention was also used for ViT and multimodal encoder attention.
  • FlashInfer was used for top-k / top-p sampling.
  • torch.compile was enabled and cached the graph for compile range (1, 8192).
  • CUDA graph capture completed successfully for decode.
  • Multimodal warmup completed successfully.
  • The OpenAI-compatible API server started successfully on port 8000.

Docker Compose Used

vllm-vl:
  image: vllm/vllm-openai:nightly-x86_64
  container_name: vllm-vl
  hostname: vllm-vl
  restart: unless-stopped
  runtime: nvidia
  network_mode: host
  ipc: host
  shm_size: '16gb'
  volumes:
    - /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
    - /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
    - /mnt/dados/storage/models/Qwen:/models:ro
    - /mnt/dados/storage/config:/configs:ro
  environment:
    - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    - TORCH_CUDA_ARCH_LIST=12.0
    - CUDA_DEVICE_ORDER=PCI_BUS_ID
    - NVIDIA_VISIBLE_DEVICES=all
    - NVIDIA_DRIVER_CAPABILITIES=all
    - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
    - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    - OMP_NUM_THREADS=14
    - VLLM_NO_USAGE_STATS=1
    - DO_NOT_TRACK=1
    - VLLM_TEST_FORCE_FP8_MARLIN=1
    - VLLM_MARLIN_USE_ATOMIC_ADD=1
    - TORCH_MATMUL_PRECISION=high
    - NVIDIA_FORWARD_COMPAT=1
    - NVIDIA_DISABLE_REQUIRE=1
    - VLLM_USE_FLASHINFER_SAMPLER=1
    - VLLM_NVFP4_GEMM_BACKEND=marlin
    - VLLM_VIDEO_LOADER_BACKEND=opencv
    - VLLM_TARGET_DEVICE=cuda
    - VLLM_FLOAT32_MATMUL_PRECISION=high
    - VLLM_USE_STANDALONE_COMPILE=1
    - VLLM_ENABLE_V1_MULTIPROCESSING=1
    - VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
    - VLLM_SKIP_MODEL_NAME_VALIDATION=1
    - VLLM_LOGGING_LEVEL=INFO
    - VLLM_LOG_STATS_INTERVAL=10
    - VLLM_XGRAMMAR_CACHE_MB=1024
    - VLLM_IMAGE_FETCH_TIMEOUT=10
    - VLLM_AUDIO_FETCH_TIMEOUT=30
    - VLLM_MEDIA_CACHE=/root/.cache/vllm/media
    - VLLM_MEDIA_CACHE_MAX_SIZE_MB=10240
    - VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=100
  command: >
    --config /configs/Qwen3-VL.yaml
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

Suggested vLLM Command

vllm serve /models/Qwen3-VL-8B-Thinking-NVFP4 \
  --trust-remote-code \
  --served-model-name Qwen3-VL \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.93 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --prefix-caching-hash-algo xxhash \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --performance-mode interactivity \
  --port 8000

Known Runtime Warnings

The following warnings were observed and are not necessarily fatal:

  • VLLM_BUILD_COMMIT, VLLM_BUILD_PIPELINE, VLLM_BUILD_URL, and VLLM_IMAGE_TAG were detected as unknown vLLM environment variables.
  • Qwen2VLImageProcessorFast deprecation warnings were emitted by Transformers.
  • fuse_attn_quant was reported as incompatible with piecewise CUDA Graphs when graph partitioning was disabled.
  • AllReduce fusion was disabled because tp_size <= 1.
  • MLA attention + quant fusion was enabled, but no MLA layers were found.
  • Triton JIT compilation occurred during early inference for _compute_slot_mapping_kernel, _bilinear_pos_embed_kernel, and rotary_kernel.

Reproducibility Notes

These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:

  • prompt length,
  • output length,
  • number and resolution of images,
  • video frame count,
  • sampling settings,
  • driver/CUDA versions,
  • vLLM nightly build,
  • whether torch.compile cache is already warm,
  • FlashAttention / FlashInfer backend behavior,
  • KV cache dtype,
  • context length,
  • multimodal preprocessing settings.

Status

✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Multimodal warmup completed
✅ Qwen3 XML tool parser enabled
✅ Long context configured at 32,768 tokens
✅ NVFP4A16 compressed-tensors runtime working
✅ Tested on RTX 5060 Ti 16GB

Downloads last month
22
Safetensors
Model size
6B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for murilonwt/Qwen3-VL-8B-Thinking-NVFP4

Quantized
(32)
this model