File size: 6,813 Bytes
5c93381
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
---

language:
  - en
  - zh
license: apache-2.0
library_name: rkllm
tags:
  - rkllm
  - rknn
  - rk3588
  - npu
  - qwen3-vl
  - vision-language
  - orange-pi
  - edge-ai
  - ocr
base_model: Qwen/Qwen3-VL-2B-Instruct
pipeline_tag: image-text-to-text
---


# Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)

Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the **Rockchip RK3588 NPU** using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3.

Runs on **Orange Pi 5 Plus**, **Rock 5B**, **Radxa NX5**, and other RK3588-based SBCs with 8GB+ RAM.

## Files

| File | Size | Description |
|---|---|---|
| `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` | 2.3 GB | LLM decoder (W8A8 quantized) β€” shared by all vision resolutions |
| `qwen3-vl-2b_vision_448_rk3588.rknn` | 812 MB | Vision encoder @ 448Γ—448 (default, 196 tokens) |
| `qwen3-vl-2b_vision_672_rk3588.rknn` | 854 MB | Vision encoder @ 672Γ—672 (441 tokens) ⭐ **Recommended** |
| `qwen3-vl-2b_vision_896_rk3588.rknn` | 923 MB | Vision encoder @ 896Γ—896 (784 tokens) |

## Choosing a Vision Encoder Resolution

The LLM decoder (`.rkllm`) is resolution-independent β€” only the vision encoder (`.rknn`) changes. Place **one** `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them.

| Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For |
|---|---|---|---|---|
| **448Γ—448** | 196 (14Γ—14) | ~2s | ~5-10s | General scene description, fast responses |
| **672Γ—672** ⭐ | 441 (21Γ—21) | ~4s | ~9-11s | **Balanced: good detail + reasonable speed** |
| **896Γ—896** | 784 (28Γ—28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks |

\*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.



### Resolution Math



Qwen3-VL uses `patch_size=16` and `merge_size=2`, so:

- Resolution must be **divisible by 32** (16 Γ— 2)

- Visual tokens = (height/32)Β² = 196 / 441 / 784 for 448 / 672 / 896



Higher resolution = more visual tokens = better fine detail but:

- Proportionally more NPU compute for the vision encoder

- More tokens for the LLM to process (longer prefill)

- Same decode speed (~15 tok/s) β€” only "time to first token" increases



## Quick Start



### Directory Structure



```

~/models/Qwen3-VL-2b/

    qwen3-vl-2b-instruct_w8a8_rk3588.rkllm   # LLM decoder (always needed)

    qwen3-vl-2b_vision_672_rk3588.rknn        # Active vision encoder

    qwen3-vl-2b_vision_448_rk3588.rknn.alt    # Alternative (inactive)

    qwen3-vl-2b_vision_896_rk3588.rknn.alt    # Alternative (inactive)

```



### Switching Resolution



To switch to a different resolution, rename the files:



```bash

cd ~/models/Qwen3-VL-2b/



# Deactivate current encoder

mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt



# Activate the 896 encoder

mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn



# Restart your API server

sudo systemctl restart rkllm-api

```



### Using with RKLLM API Server



This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes.



## Export Details



### LLM Decoder



- **Source**: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)

- **Quantization**: W8A8 (8-bit weights, 8-bit activations)

- **Tool**: rkllm-toolkit v1.2.3

- **Context**: 4096 tokens



### Vision Encoders



- **Source**: Qwen3-VL-2B-Instruct visual encoder weights

- **Export pipeline**: HuggingFace model β†’ ONNX (`export_vision.py`) β†’ RKNN (`export_vision_rknn.py`)

- **Tool**: rknn-toolkit2 v2.3.2

- **Precision**: FP32 (no quantization β€” vision encoder quality is critical)

- **Target**: rk3588



The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export).



### Re-exporting at a Custom Resolution



To export the vision encoder at a different resolution (must be divisible by 32):



```bash

# Activate the export environment

source ~/rkllm-env/bin/activate

cd ~/rknn-llm/examples/multimodal_model_demo



# Step 1: Export HuggingFace model to ONNX

python3 export/export_vision.py \

  --path ~/models-hf/Qwen3-VL-2B-Instruct \

  --model_name qwen3-vl \

  --height 672 --width 672 \

  --device cpu



# Step 2: Convert ONNX to RKNN

python3 export/export_vision_rknn.py \

  --path ./onnx/qwen3-vl_vision.onnx \

  --model_name qwen3-vl \

  --target-platform rk3588 \

  --height 672 --width 672

```



**Memory requirements**: ~20 GB RAM (or swap) for 672Γ—672, ~35 GB for 896Γ—896. CPU-only export works fine (no GPU needed).



**Dependencies** (in a Python 3.10 venv):

- `rknn-toolkit2 >= 2.3.2`

- `torch == 2.4.0`

- `transformers >= 4.57.0`

- `onnx >= 1.18.0`



## Performance Benchmarks



Tested on **Orange Pi 5 Plus (16GB RAM)**, RK3588 SoC, RKNPU driver 0.9.8:



| Metric | 448Γ—448 | 672Γ—672 | 896Γ—896 |

|---|---|---|---|

| Vision encode time | ~2 s | ~4 s | ~12 s |

| Total VL response | 5–10 s | 9–11 s | 25–28 s |

| Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s |

| Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB |

| RKNN file size | 812 MB | 854 MB | 923 MB |



## Known Limitations



- **OCR accuracy**: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.

- **Fixed resolution**: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.

- **REGTASK warnings**: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.



## License



Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct).



## Credits



- **Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct

- **Runtime**: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2

- **API Server**: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) β€” OpenAI-compatible server for RK3588 NPU