Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)
Pre-converted Qwen3-VL-2B-Instruct for the Rockchip RK3588 NPU using rknn-llm runtime v1.2.3.
Runs on Orange Pi 5 Plus, Rock 5B, Radxa NX5, and other RK3588-based SBCs with 8GB+ RAM.
Files
| File | Size | Description |
|---|---|---|
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm |
2.3 GB | LLM decoder (W8A8 quantized) β shared by all vision resolutions |
qwen3-vl-2b_vision_448_rk3588.rknn |
812 MB | Vision encoder @ 448Γ448 (default, 196 tokens) |
qwen3-vl-2b_vision_672_rk3588.rknn |
854 MB | Vision encoder @ 672Γ672 (441 tokens) β Recommended |
qwen3-vl-2b_vision_896_rk3588.rknn |
923 MB | Vision encoder @ 896Γ896 (784 tokens) |
Choosing a Vision Encoder Resolution
The LLM decoder (.rkllm) is resolution-independent β only the vision encoder (.rknn) changes. Place one .rknn file alongside the .rkllm in your model directory, or rename alternatives to .rknn.alt to disable them.
| Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For |
|---|---|---|---|---|
| 448Γ448 | 196 (14Γ14) | ~2s | ~5-10s | General scene description, fast responses |
| 672Γ672 β | 441 (21Γ21) | ~4s | ~9-11s | Balanced: good detail + reasonable speed |
| 896Γ896 | 784 (28Γ28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks |
*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.
Resolution Math
Qwen3-VL uses patch_size=16 and merge_size=2, so:
- Resolution must be divisible by 32 (16 Γ 2)
- Visual tokens = (height/32)Β² = 196 / 441 / 784 for 448 / 672 / 896
Higher resolution = more visual tokens = better fine detail but:
- Proportionally more NPU compute for the vision encoder
- More tokens for the LLM to process (longer prefill)
- Same decode speed (~15 tok/s) β only "time to first token" increases
Quick Start
Directory Structure
~/models/Qwen3-VL-2b/
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (always needed)
qwen3-vl-2b_vision_672_rk3588.rknn # Active vision encoder
qwen3-vl-2b_vision_448_rk3588.rknn.alt # Alternative (inactive)
qwen3-vl-2b_vision_896_rk3588.rknn.alt # Alternative (inactive)
Switching Resolution
To switch to a different resolution, rename the files:
cd ~/models/Qwen3-VL-2b/
# Deactivate current encoder
mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt
# Activate the 896 encoder
mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn
# Restart your API server
sudo systemctl restart rkllm-api
Using with RKLLM API Server
This model is designed for use with the RKLLM API Server, which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the .rknn file's input tensor attributes.
Export Details
LLM Decoder
- Source: Qwen/Qwen3-VL-2B-Instruct
- Quantization: W8A8 (8-bit weights, 8-bit activations)
- Tool: rkllm-toolkit v1.2.3
- Context: 4096 tokens
Vision Encoders
- Source: Qwen3-VL-2B-Instruct visual encoder weights
- Export pipeline: HuggingFace model β ONNX (
export_vision.py) β RKNN (export_vision_rknn.py) - Tool: rknn-toolkit2 v2.3.2
- Precision: FP32 (no quantization β vision encoder quality is critical)
- Target: rk3588
The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom --height and --width flags to export_vision.py and export_vision_rknn.py from the rknn-llm multimodal demo.
Re-exporting at a Custom Resolution
To export the vision encoder at a different resolution (must be divisible by 32):
# Activate the export environment
source ~/rkllm-env/bin/activate
cd ~/rknn-llm/examples/multimodal_model_demo
# Step 1: Export HuggingFace model to ONNX
python3 export/export_vision.py \
--path ~/models-hf/Qwen3-VL-2B-Instruct \
--model_name qwen3-vl \
--height 672 --width 672 \
--device cpu
# Step 2: Convert ONNX to RKNN
python3 export/export_vision_rknn.py \
--path ./onnx/qwen3-vl_vision.onnx \
--model_name qwen3-vl \
--target-platform rk3588 \
--height 672 --width 672
Memory requirements: ~20 GB RAM (or swap) for 672Γ672, ~35 GB for 896Γ896. CPU-only export works fine (no GPU needed).
Dependencies (in a Python 3.10 venv):
rknn-toolkit2 >= 2.3.2torch == 2.4.0transformers >= 4.57.0onnx >= 1.18.0
Performance Benchmarks
Tested on Orange Pi 5 Plus (16GB RAM), RK3588 SoC, RKNPU driver 0.9.8:
| Metric | 448Γ448 | 672Γ672 | 896Γ896 |
|---|---|---|---|
| Vision encode time | ~2 s | ~4 s | ~12 s |
| Total VL response | 5β10 s | 9β11 s | 25β28 s |
| Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s |
| Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB |
| RKNN file size | 812 MB | 854 MB | 923 MB |
Known Limitations
- OCR accuracy: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
- Fixed resolution: Each
.rknnfile is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file. - REGTASK warnings: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.
License
Apache 2.0, inherited from Qwen3-VL-2B-Instruct.
Credits
- Model: Qwen Team for Qwen3-VL-2B-Instruct
- Runtime: Rockchip / airockchip for rkllm-toolkit and rknn-toolkit2
- API Server: RKLLM API Server β OpenAI-compatible server for RK3588 NPU
- Downloads last month
- -
Model tree for GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3
Base model
Qwen/Qwen3-VL-2B-Instruct