File size: 6,813 Bytes
5c93381 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
language:
- en
- zh
license: apache-2.0
library_name: rkllm
tags:
- rkllm
- rknn
- rk3588
- npu
- qwen3-vl
- vision-language
- orange-pi
- edge-ai
- ocr
base_model: Qwen/Qwen3-VL-2B-Instruct
pipeline_tag: image-text-to-text
---
# Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)
Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the **Rockchip RK3588 NPU** using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3.
Runs on **Orange Pi 5 Plus**, **Rock 5B**, **Radxa NX5**, and other RK3588-based SBCs with 8GB+ RAM.
## Files
| File | Size | Description |
|---|---|---|
| `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` | 2.3 GB | LLM decoder (W8A8 quantized) β shared by all vision resolutions |
| `qwen3-vl-2b_vision_448_rk3588.rknn` | 812 MB | Vision encoder @ 448Γ448 (default, 196 tokens) |
| `qwen3-vl-2b_vision_672_rk3588.rknn` | 854 MB | Vision encoder @ 672Γ672 (441 tokens) β **Recommended** |
| `qwen3-vl-2b_vision_896_rk3588.rknn` | 923 MB | Vision encoder @ 896Γ896 (784 tokens) |
## Choosing a Vision Encoder Resolution
The LLM decoder (`.rkllm`) is resolution-independent β only the vision encoder (`.rknn`) changes. Place **one** `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them.
| Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For |
|---|---|---|---|---|
| **448Γ448** | 196 (14Γ14) | ~2s | ~5-10s | General scene description, fast responses |
| **672Γ672** β | 441 (21Γ21) | ~4s | ~9-11s | **Balanced: good detail + reasonable speed** |
| **896Γ896** | 784 (28Γ28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks |
\*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.
### Resolution Math
Qwen3-VL uses `patch_size=16` and `merge_size=2`, so:
- Resolution must be **divisible by 32** (16 Γ 2)
- Visual tokens = (height/32)Β² = 196 / 441 / 784 for 448 / 672 / 896
Higher resolution = more visual tokens = better fine detail but:
- Proportionally more NPU compute for the vision encoder
- More tokens for the LLM to process (longer prefill)
- Same decode speed (~15 tok/s) β only "time to first token" increases
## Quick Start
### Directory Structure
```
~/models/Qwen3-VL-2b/
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (always needed)
qwen3-vl-2b_vision_672_rk3588.rknn # Active vision encoder
qwen3-vl-2b_vision_448_rk3588.rknn.alt # Alternative (inactive)
qwen3-vl-2b_vision_896_rk3588.rknn.alt # Alternative (inactive)
```
### Switching Resolution
To switch to a different resolution, rename the files:
```bash
cd ~/models/Qwen3-VL-2b/
# Deactivate current encoder
mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt
# Activate the 896 encoder
mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn
# Restart your API server
sudo systemctl restart rkllm-api
```
### Using with RKLLM API Server
This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes.
## Export Details
### LLM Decoder
- **Source**: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
- **Quantization**: W8A8 (8-bit weights, 8-bit activations)
- **Tool**: rkllm-toolkit v1.2.3
- **Context**: 4096 tokens
### Vision Encoders
- **Source**: Qwen3-VL-2B-Instruct visual encoder weights
- **Export pipeline**: HuggingFace model β ONNX (`export_vision.py`) β RKNN (`export_vision_rknn.py`)
- **Tool**: rknn-toolkit2 v2.3.2
- **Precision**: FP32 (no quantization β vision encoder quality is critical)
- **Target**: rk3588
The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export).
### Re-exporting at a Custom Resolution
To export the vision encoder at a different resolution (must be divisible by 32):
```bash
# Activate the export environment
source ~/rkllm-env/bin/activate
cd ~/rknn-llm/examples/multimodal_model_demo
# Step 1: Export HuggingFace model to ONNX
python3 export/export_vision.py \
--path ~/models-hf/Qwen3-VL-2B-Instruct \
--model_name qwen3-vl \
--height 672 --width 672 \
--device cpu
# Step 2: Convert ONNX to RKNN
python3 export/export_vision_rknn.py \
--path ./onnx/qwen3-vl_vision.onnx \
--model_name qwen3-vl \
--target-platform rk3588 \
--height 672 --width 672
```
**Memory requirements**: ~20 GB RAM (or swap) for 672Γ672, ~35 GB for 896Γ896. CPU-only export works fine (no GPU needed).
**Dependencies** (in a Python 3.10 venv):
- `rknn-toolkit2 >= 2.3.2`
- `torch == 2.4.0`
- `transformers >= 4.57.0`
- `onnx >= 1.18.0`
## Performance Benchmarks
Tested on **Orange Pi 5 Plus (16GB RAM)**, RK3588 SoC, RKNPU driver 0.9.8:
| Metric | 448Γ448 | 672Γ672 | 896Γ896 |
|---|---|---|---|
| Vision encode time | ~2 s | ~4 s | ~12 s |
| Total VL response | 5β10 s | 9β11 s | 25β28 s |
| Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s |
| Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB |
| RKNN file size | 812 MB | 854 MB | 923 MB |
## Known Limitations
- **OCR accuracy**: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
- **Fixed resolution**: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.
- **REGTASK warnings**: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.
## License
Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct).
## Credits
- **Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct
- **Runtime**: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2
- **API Server**: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) β OpenAI-compatible server for RK3588 NPU
|