Qwen3-VL-8B-Thinking NVFP4A16 — vLLM / RTX 5060 Ti Test
This repository contains a locally quantized Qwen3-VL-8B-Thinking model using NVFP4A16 / compressed-tensors, tested with vLLM nightly on a consumer NVIDIA Blackwell GPU.
This model was prepared for vLLM inference with multimodal support enabled. The language backbone was quantized to NVFP4A16, while the visual encoder was kept in BF16 to avoid Marlin tile-size incompatibilities in Qwen3-VL visual MLP layers.
This model card section documents the local runtime test performed by Murilo Vieira on RTX 5060 Ti 16GB.
Quantization Summary
| Item | Value |
|---|---|
| Base model | Qwen3-VL-8B-Thinking |
| Quantization format | NVFP4A16 |
| Runtime quantization loader | compressed-tensors |
| Runtime | vLLM OpenAI server |
| vLLM version tested | 0.21.1rc1.dev46+gb50646e5e |
| Model architecture resolved by vLLM | Qwen3VLForConditionalGeneration |
| Runtime dtype | torch.bfloat16 |
| KV cache dtype | auto |
| Tensor parallel size | 1 |
| Pipeline parallel size | 1 |
| Data parallel size | 1 |
| Multimodal support | image, video, audio prompt limits configured |
| Language backbone | NVFP4A16 |
| Visual encoder | BF16 / not quantized |
lm_head |
BF16 / not quantized |
Why the Visual Encoder Was Not Quantized
During the initial NVFP4 conversion, vLLM failed when preparing some Qwen3-VL visual MLP layers for Marlin FP4 execution. The problematic layers had output dimensions such as:
model.visual.blocks.*.mlp.linear_fc1 | out_features=4304
Marlin FP4 packing requires compatible tile dimensions, and 4304 is not divisible by 64. To make the model load successfully in vLLM, the visual encoder was excluded from NVFP4 quantization.
The final working layout is:
Language backbone: NVFP4A16
Visual encoder: BF16
lm_head: BF16
Runtime: vLLM compressed-tensors
Tested Hardware
| Component | Configuration |
|---|---|
| GPU | NVIDIA GeForce RTX 5060 Ti |
| GPU VRAM | 16 GB |
| CPU | Intel Xeon E5-2680 v4 |
| System RAM | 64 GB |
| Runtime | Docker + NVIDIA Container Runtime |
| Container image | vllm/vllm-openai:nightly-x86_64 |
vLLM Runtime Configuration
model: /models/Qwen3-VL-8B-Thinking-NVFP4
dtype: auto
gpu_memory_utilization: 0.93
max_model_len: 32768
max_num_batched_tokens: 8192
max_num_seqs: 4
enable_auto_tool_choice: true
tool_call_parser: qwen3_xml
reasoning_parser: qwen3
enable_chunked_prefill: true
performance_mode: interactivity
enforce_eager: false
trust_remote_code: true
tensor_parallel_size: 1
port: 8000
served_model_name:
- Qwen3-VL
structured_outputs_config:
backend: xgrammar
disable_any_whitespace: true
enable_in_reasoning: false
enable_server_load_tracking: true
kv_cache_metrics: true
kv_cache_metrics_sample: 0.05
limit_mm_per_prompt:
image: 3
video: 1
audio: 1
mm_processor_kwargs:
min_pixels: 3136
max_pixels: 10035200
fps: 1.0
media_io_kwargs:
video:
num_frames: 32
fps: 2
data_parallel_size: 1
video_pruning_rate: 0.75
mm_encoder_tp_mode: data
mm_processor_cache_type: shm
kv_cache_dtype: auto
enable_prefix_caching: true
prefix_caching_hash_algo: xxhash
async_scheduling: true
compilation_config:
pass_config:
fuse_allreduce_rms: true
fuse_attn_quant: true
eliminate_noops: true
generation_config: vllm
skip_mm_profiling: true
Startup and Memory Statistics
| Metric | Result |
|---|---|
| Checkpoint size reported by vLLM | 6.62 GiB |
| Available system RAM reported by vLLM | 40.09 GiB |
| Weight loading time | 4.10 s |
| Model loading GPU memory | 7.06 GiB |
| Model loading time | 6.54 s |
| Available KV cache memory | 6.24 GiB |
| GPU KV cache size | 45,456 tokens |
| Max configured context length | 32,768 tokens |
| Estimated max concurrency at full context | 1.39x |
torch.compile total time |
117.10 s |
| Engine init time | 127.30 s |
| Multimodal warmup time | 9.966 s |
| Read-only multimodal warmup time | 0.312 s |
| Server port | 8000 |
Throughput Observed
| Scenario | Prompt throughput | Generation throughput | Running requests | GPU KV usage | Prefix cache hit | MM cache hit |
|---|---|---|---|---|---|---|
| Initial request / JIT warmup | 2.3 tok/s | 2.9 tok/s | 1 | 0.1% | 0.0% | N/A |
| Single request | 0.0 tok/s | 73.2 tok/s | 1 | 1.7% | 0.0% | N/A |
| Prompt-heavy request | 54.8 tok/s | 67.3 tok/s | 1 | 2.1% | 0.0% | N/A |
| Single request | 0.0 tok/s | 71.1 tok/s | 1 | 3.6% | 0.0% | N/A |
| Multimodal / two running requests | 113.0 tok/s | 64.3 tok/s | 2 | 7.6% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 132.2 tok/s | 2 | 10.5% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 128.2 tok/s | 2 | 13.3% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 124.8 tok/s | 2 | 16.1% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 121.4 tok/s | 2 | 18.7% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 118.4 tok/s | 2 | 21.3% | 0.9% | 0.0% |
| Two running requests | 0.0 tok/s | 115.6 tok/s | 2 | 23.9% | 0.9% | 0.0% |
| Single request | 0.0 tok/s | 79.8 tok/s | 1 | 12.2% | 0.9% | 0.0% |
| Single request | 0.0 tok/s | 64.1 tok/s | 1 | 13.6% | 0.9% | 0.0% |
Runtime Behavior
- vLLM successfully resolved the architecture as
Qwen3VLForConditionalGeneration. - vLLM detected the model as
quantization=compressed-tensors. - FlashAttention was used for the main attention backend.
- FlashAttention was also used for ViT and multimodal encoder attention.
- FlashInfer was used for top-k / top-p sampling.
torch.compilewas enabled and cached the graph for compile range(1, 8192).- CUDA graph capture completed successfully for decode.
- Multimodal warmup completed successfully.
- The OpenAI-compatible API server started successfully on port
8000.
Docker Compose Used
vllm-vl:
image: vllm/vllm-openai:nightly-x86_64
container_name: vllm-vl
hostname: vllm-vl
restart: unless-stopped
runtime: nvidia
network_mode: host
ipc: host
shm_size: '16gb'
volumes:
- /mnt/dados/storage/cache/huggingface:/root/.cache/huggingface
- /mnt/dados/storage/cache/vllm-cache:/root/.cache/vllm
- /mnt/dados/storage/models/Qwen:/models:ro
- /mnt/dados/storage/config:/configs:ro
environment:
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- TORCH_CUDA_ARCH_LIST=12.0
- CUDA_DEVICE_ORDER=PCI_BUS_ID
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
- VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
- OMP_NUM_THREADS=14
- VLLM_NO_USAGE_STATS=1
- DO_NOT_TRACK=1
- VLLM_TEST_FORCE_FP8_MARLIN=1
- VLLM_MARLIN_USE_ATOMIC_ADD=1
- TORCH_MATMUL_PRECISION=high
- NVIDIA_FORWARD_COMPAT=1
- NVIDIA_DISABLE_REQUIRE=1
- VLLM_USE_FLASHINFER_SAMPLER=1
- VLLM_NVFP4_GEMM_BACKEND=marlin
- VLLM_VIDEO_LOADER_BACKEND=opencv
- VLLM_TARGET_DEVICE=cuda
- VLLM_FLOAT32_MATMUL_PRECISION=high
- VLLM_USE_STANDALONE_COMPILE=1
- VLLM_ENABLE_V1_MULTIPROCESSING=1
- VLLM_V1_OUTPUT_PROC_CHUNK_SIZE=128
- VLLM_SKIP_MODEL_NAME_VALIDATION=1
- VLLM_LOGGING_LEVEL=INFO
- VLLM_LOG_STATS_INTERVAL=10
- VLLM_XGRAMMAR_CACHE_MB=1024
- VLLM_IMAGE_FETCH_TIMEOUT=10
- VLLM_AUDIO_FETCH_TIMEOUT=30
- VLLM_MEDIA_CACHE=/root/.cache/vllm/media
- VLLM_MEDIA_CACHE_MAX_SIZE_MB=10240
- VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=100
command: >
--config /configs/Qwen3-VL.yaml
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
Suggested vLLM Command
vllm serve /models/Qwen3-VL-8B-Thinking-NVFP4 \
--trust-remote-code \
--served-model-name Qwen3-VL \
--max-model-len 32768 \
--gpu-memory-utilization 0.93 \
--max-num-batched-tokens 8192 \
--max-num-seqs 4 \
--enable-chunked-prefill \
--enable-prefix-caching \
--prefix-caching-hash-algo xxhash \
--tool-call-parser qwen3_xml \
--reasoning-parser qwen3 \
--performance-mode interactivity \
--port 8000
Known Runtime Warnings
The following warnings were observed and are not necessarily fatal:
VLLM_BUILD_COMMIT,VLLM_BUILD_PIPELINE,VLLM_BUILD_URL, andVLLM_IMAGE_TAGwere detected as unknown vLLM environment variables.Qwen2VLImageProcessorFastdeprecation warnings were emitted by Transformers.fuse_attn_quantwas reported as incompatible with piecewise CUDA Graphs when graph partitioning was disabled.AllReducefusion was disabled becausetp_size <= 1.- MLA attention + quant fusion was enabled, but no MLA layers were found.
- Triton JIT compilation occurred during early inference for
_compute_slot_mapping_kernel,_bilinear_pos_embed_kernel, androtary_kernel.
Reproducibility Notes
These results are from a single local test run and should be treated as environment-specific. Throughput may vary depending on:
- prompt length,
- output length,
- number and resolution of images,
- video frame count,
- sampling settings,
- driver/CUDA versions,
- vLLM nightly build,
- whether
torch.compilecache is already warm, - FlashAttention / FlashInfer backend behavior,
- KV cache dtype,
- context length,
- multimodal preprocessing settings.
Status
✅ Loaded successfully
✅ OpenAI-compatible server started
✅ Multimodal warmup completed
✅ Qwen3 XML tool parser enabled
✅ Long context configured at 32,768 tokens
✅ NVFP4A16 compressed-tensors runtime working
✅ Tested on RTX 5060 Ti 16GB
- Downloads last month
- 22
Model tree for murilonwt/Qwen3-VL-8B-Thinking-NVFP4
Base model
Qwen/Qwen3-VL-8B-Thinking