Qwen3-VL-8B-Thinking NVFP4A16 — vLLM / RTX 5060 Ti Test

This repository contains a locally quantized Qwen3-VL-8B-Thinking model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.

This model was prepared for vLLM inference with multimodal support enabled. The language backbone was quantized to NVFP4A16, while the visual encoder was kept in BF16 to avoid Marlin tile-size incompatibilities in Qwen3-VL visual MLP layers.

This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.

Quantization Summary

Item	Value
Base model	Qwen3-VL-8B-Thinking
Quantization format	NVFP4A16
Runtime quantization loader	`compressed-tensors`
Runtime	vLLM OpenAI server
vLLM version tested	`0.21.1rc1.dev46+gb50646e5e`
Model architecture resolved by vLLM	`Qwen3VLForConditionalGeneration`
Runtime dtype	`torch.bfloat16`
KV cache dtype	`auto`
Tensor parallel size	1
Pipeline parallel size	1
Data parallel size	1
Multimodal support	image, video, audio prompt limits configured
Language backbone	NVFP4A16
Visual encoder	BF16 / not quantized
`lm_head`	BF16 / not quantized

Why the Visual Encoder Was Not Quantized

During the initial NVFP4 conversion, vLLM failed when preparing some Qwen3-VL visual MLP layers for Marlin FP4 execution. The problematic layers had output dimensions such as:

model.visual.blocks.*.mlp.linear_fc1 | out_features=4304

Marlin FP4 packing requires compatible tile dimensions, and 4304 is not divisible by 64. To make the model load successfully in vLLM, the visual encoder was excluded from NVFP4 quantization.

The final working layout is:

Language backbone: NVFP4A16
Visual encoder: BF16
lm_head: BF16
Runtime: vLLM compressed-tensors

Tested Hardware

Component	Configuration
GPU	NVIDIA GeForce RTX 5060 Ti
GPU VRAM	16 GB
CPU	Intel Xeon E5-2680 v4
System RAM	64 GB
Runtime	Docker + NVIDIA Container Runtime
Container image	`vllm/vllm-openai:nightly-x86_64`

vLLM Runtime Configuration

model: /models/Qwen3-VL-8B-Thinking-NVFP4
dtype: auto
gpu_memory_utilization: 0.93
max_model_len: 32768
max_num_batched_tokens: 8192
max_num_seqs: 4

enable_auto_tool_choice: true
tool_call_parser: qwen3_xml
reasoning_parser: qwen3

enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: false
trust_remote_code: true
tensor_parallel_size: 1

port: 8000
served_model_name:
  - Qwen3-VL

structured_outputs_config:
  backend: xgrammar
  disable_any_whitespace: true
  enable_in_reasoning: false

enable_server_load_tracking: true
kv_cache_metrics: true
kv_cache_metrics_sample: 0.05

limit_mm_per_prompt:
  image: 3
  video: 1
  audio: 1

mm_processor_kwargs:
  min_pixels: 3136
  max_pixels: 10035200
  fps: 1.0

media_io_kwargs:
  video:
    num_frames: 32
    fps: 2

data_parallel_size: 1
video_pruning_rate: 0.75

mm_encoder_tp_mode: data
mm_processor_cache_type: shm

kv_cache_dtype: auto
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash

async_scheduling: true

compilation_config:
  pass_config:
    fuse_allreduce_rms: true
    fuse_attn_quant: true
    eliminate_noops: true

generation_config: vllm
skip_mm_profiling: true

Startup and Memory Statistics

Metric	Result
Checkpoint size reported by vLLM	6.62 GiB
Available system RAM reported by vLLM	40.09 GiB
Weight loading time	4.10 s
Model loading GPU memory	7.06 GiB
Model loading time	6.54 s
Available KV cache memory	6.24 GiB
GPU KV cache size	45,456 tokens
Max configured context length	32,768 tokens
Estimated max concurrency at full context	1.39x
`torch.compile` total time	117.10 s
Engine init time	127.30 s
Multimodal warmup time	9.966 s
Read-only multimodal warmup time	0.312 s
Server port	8000

Throughput Observed

Scenario	Prompt throughput	Generation throughput	Running requests	GPU KV usage	Prefix cache hit	MM cache hit
Initial request / JIT warmup	2.3 tok/s	2.9 tok/s	1	0.1%	0.0%	N/A
Single request	0.0 tok/s	73.2 tok/s	1	1.7%	0.0%	N/A
Prompt-heavy request	54.8 tok/s	67.3 tok/s	1	2.1%	0.0%	N/A
Single request	0.0 tok/s	71.1 tok/s	1	3.6%	0.0%	N/A
Multimodal / two running requests	113.0 tok/s	64.3 tok/s	2	7.6%	0.9%	0.0%
Two running requests	0.0 tok/s	132.2 tok/s	2	10.5%	0.9%	0.0%
Two running requests	0.0 tok/s	128.2 tok/s	2	13.3%	0.9%	0.0%
Two running requests	0.0 tok/s	124.8 tok/s	2	16.1%	0.9%	0.0%
Two running requests	0.0 tok/s	121.4 tok/s	2	18.7%	0.9%	0.0%
Two running requests	0.0 tok/s	118.4 tok/s	2	21.3%	0.9%	0.0%
Two running requests	0.0 tok/s	115.6 tok/s	2	23.9%	0.9%	0.0%
Single request	0.0 tok/s	79.8 tok/s	1	12.2%	0.9%	0.0%
Single request	0.0 tok/s	64.1 tok/s	1	13.6%	0.9%	0.0%

Runtime Behavior

vLLM successfully resolved the architecture as Qwen3VLForConditionalGeneration.
vLLM detected the model as quantization=compressed-tensors.
FlashAttention was used for the main attention backend.
FlashAttention was also used for ViT and multimodal encoder attention.
FlashInfer was used for top-k / top-p sampling.
torch.compile was enabled and cached the graph for compile range (1, 8192).
CUDA graph capture completed successfully for decode.
Multimodal warmup completed successfully.
The OpenAI-compatible API server started successfully on port 8000.

Docker Compose Used

vllm-vl:
  image: vllm/vllm-openai:nightly-x86_64
  container_name: vllm-vl
  hostname: vllm-vl
  restart: unless-stopped
  runtime: nvidia
  network_mode: host
  ipc: host
  shm_size: '16gb'
  volumes:
    - /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
    - /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
    - /mnt/dados/storage/models/Qwen:/models:ro
    - /mnt/dados/storage/config:/configs:ro
  environment:
    - VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
    - TORCH_CUDA_ARCH_LIST=12.0
    - CUDA_DEVICE_ORDER=PCI_BUS_ID
    - NVIDIA_VISIBLE_DEVICES=all
    - NVIDIA_DRIVER_CAPABILITIES=all
    - VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
    - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    - OMP_NUM_THREADS=14
    - VLLM_NO_USAGE_STATS=1
    - DO_NOT_TRACK=1
    - VLLM_TEST_FORCE_FP8_MARLIN=1
    - VLLM_MARLIN_USE_ATOMIC_ADD=1
    - TORCH_MATMUL_PRECISION=high
    - NVIDIA_FORWARD_COMPAT=1
    - NVIDIA_DISABLE_REQUIRE=1
    - VLLM_USE_FLASHINFER_SAMPLER=1
    - VLLM_NVFP4_GEMM_BACKEND=marlin
    - VLLM_VIDEO_LOADER_BACKEND=opencv
    - VLLM_TARGET_DEVICE=cuda
    - VLLM_FLOAT32_MATMUL_PRECISION=high
    - VLLM_USE_STANDALONE_COMPILE=1
    - VLLM_ENABLE_V1_MULTIPROCESSING=1
    - VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
    - VLLM_SKIP_MODEL_NAME_VALIDATION=1
    - VLLM_LOGGING_LEVEL=INFO
    - VLLM_LOG_STATS_INTERVAL=10
    - VLLM_XGRAMMAR_CACHE_MB=1024
    - VLLM_IMAGE_FETCH_TIMEOUT=10
    - VLLM_AUDIO_FETCH_TIMEOUT=30
    - VLLM_MEDIA_CACHE=/root/.cache/vllm/media
    - VLLM_MEDIA_CACHE_MAX_SIZE_MB=10240
    - VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=100
  command: >
    --config /configs/Qwen3-VL.yaml
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            device_ids: ['1']
            capabilities: [gpu]

Suggested vLLM Command

vllm serve /models/Qwen3-VL-8B-Thinking-NVFP4 \
  --trust-remote-code \
  --served-model-name Qwen3-VL \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.93 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 4 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --prefix-caching-hash-algo xxhash \
  --tool-call-parser qwen3_xml \
  --reasoning-parser qwen3 \
  --performance-mode interactivity \
  --port 8000

Known Runtime Warnings

The following warnings were observed and are not necessarily fatal:

VLLM_BUILD_COMMIT, VLLM_BUILD_PIPELINE, VLLM_BUILD_URL, and VLLM_IMAGE_TAG were detected as unknown vLLM environment variables.
Qwen2VLImageProcessorFast deprecation warnings were emitted by Transformers.
fuse_attn_quant was reported as incompatible with piecewise CUDA Graphs when graph partitioning was disabled.
AllReduce fusion was disabled because tp_size <= 1.
MLA attention + quant fusion was enabled, but no MLA layers were found.
Triton JIT compilation occurred during early inference for _compute_slot_mapping_kernel, _bilinear_pos_embed_kernel, and rotary_kernel.

Reproducibility Notes

These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:

prompt length,
output length,
number and resolution of images,
video frame count,
sampling settings,
driver/CUDA versions,
vLLM nightly build,
whether torch.compile cache is already warm,
FlashAttention / FlashInfer backend behavior,
KV cache dtype,
context length,
multimodal preprocessing settings.

Status

✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Multimodal warmup completed
✅ Qwen3 XML tool parser enabled
✅ Long context configured at 32,768 tokens
✅ NVFP4A16 compressed-tensors runtime working
✅ Tested on RTX 5060 Ti 16GB

Downloads last month: 4

Safetensors

Model size

6B params

Tensor type

F32

BF16

F8_E4M3

Model tree for murilonwt/Qwen3-VL-8B-Thinking-NVFP4

Base model

Qwen/Qwen3-VL-8B-Thinking

Quantized

(33)

this model