GatekeeperZA commited on
Commit
5c93381
·
verified ·
1 Parent(s): 2067bb2

Add RKLLM v1.2.3 model files: LLM decoder (W8A8) + vision encoders at 448/672/896

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ qwen3-vl-2b-instruct_w8a8_rk3588.rkllm filter=lfs diff=lfs merge=lfs -text
37
+ qwen3-vl-2b_vision_448_rk3588.rknn filter=lfs diff=lfs merge=lfs -text
38
+ qwen3-vl-2b_vision_672_rk3588.rknn filter=lfs diff=lfs merge=lfs -text
39
+ qwen3-vl-2b_vision_896_rk3588.rknn filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: apache-2.0
6
+ library_name: rkllm
7
+ tags:
8
+ - rkllm
9
+ - rknn
10
+ - rk3588
11
+ - npu
12
+ - qwen3-vl
13
+ - vision-language
14
+ - orange-pi
15
+ - edge-ai
16
+ - ocr
17
+ base_model: Qwen/Qwen3-VL-2B-Instruct
18
+ pipeline_tag: image-text-to-text
19
+ ---
20
+
21
+ # Qwen3-VL-2B-Instruct for RKLLM v1.2.3 (RK3588 NPU)
22
+
23
+ Pre-converted [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) for the **Rockchip RK3588 NPU** using [rknn-llm](https://github.com/airockchip/rknn-llm) runtime v1.2.3.
24
+
25
+ Runs on **Orange Pi 5 Plus**, **Rock 5B**, **Radxa NX5**, and other RK3588-based SBCs with 8GB+ RAM.
26
+
27
+ ## Files
28
+
29
+ | File | Size | Description |
30
+ |---|---|---|
31
+ | `qwen3-vl-2b-instruct_w8a8_rk3588.rkllm` | 2.3 GB | LLM decoder (W8A8 quantized) — shared by all vision resolutions |
32
+ | `qwen3-vl-2b_vision_448_rk3588.rknn` | 812 MB | Vision encoder @ 448×448 (default, 196 tokens) |
33
+ | `qwen3-vl-2b_vision_672_rk3588.rknn` | 854 MB | Vision encoder @ 672×672 (441 tokens) ⭐ **Recommended** |
34
+ | `qwen3-vl-2b_vision_896_rk3588.rknn` | 923 MB | Vision encoder @ 896×896 (784 tokens) |
35
+
36
+ ## Choosing a Vision Encoder Resolution
37
+
38
+ The LLM decoder (`.rkllm`) is resolution-independent — only the vision encoder (`.rknn`) changes. Place **one** `.rknn` file alongside the `.rkllm` in your model directory, or rename alternatives to `.rknn.alt` to disable them.
39
+
40
+ | Resolution | Visual Tokens | Encoder Time* | Total Response* | Best For |
41
+ |---|---|---|---|---|
42
+ | **448×448** | 196 (14×14) | ~2s | ~5-10s | General scene description, fast responses |
43
+ | **672×672** ⭐ | 441 (21×21) | ~4s | ~9-11s | **Balanced: good detail + reasonable speed** |
44
+ | **896×896** | 784 (28×28) | ~12s | ~25-28s | Maximum detail, fine text/OCR tasks |
45
+
46
+ \*Measured on Orange Pi 5 Plus (16GB) with 14MB JPEG input, single image.
47
+
48
+ ### Resolution Math
49
+
50
+ Qwen3-VL uses `patch_size=16` and `merge_size=2`, so:
51
+ - Resolution must be **divisible by 32** (16 × 2)
52
+ - Visual tokens = (height/32)² = 196 / 441 / 784 for 448 / 672 / 896
53
+
54
+ Higher resolution = more visual tokens = better fine detail but:
55
+ - Proportionally more NPU compute for the vision encoder
56
+ - More tokens for the LLM to process (longer prefill)
57
+ - Same decode speed (~15 tok/s) — only "time to first token" increases
58
+
59
+ ## Quick Start
60
+
61
+ ### Directory Structure
62
+
63
+ ```
64
+ ~/models/Qwen3-VL-2b/
65
+ qwen3-vl-2b-instruct_w8a8_rk3588.rkllm # LLM decoder (always needed)
66
+ qwen3-vl-2b_vision_672_rk3588.rknn # Active vision encoder
67
+ qwen3-vl-2b_vision_448_rk3588.rknn.alt # Alternative (inactive)
68
+ qwen3-vl-2b_vision_896_rk3588.rknn.alt # Alternative (inactive)
69
+ ```
70
+
71
+ ### Switching Resolution
72
+
73
+ To switch to a different resolution, rename the files:
74
+
75
+ ```bash
76
+ cd ~/models/Qwen3-VL-2b/
77
+
78
+ # Deactivate current encoder
79
+ mv qwen3-vl-2b_vision_672_rk3588.rknn qwen3-vl-2b_vision_672_rk3588.rknn.alt
80
+
81
+ # Activate the 896 encoder
82
+ mv qwen3-vl-2b_vision_896_rk3588.rknn.alt qwen3-vl-2b_vision_896_rk3588.rknn
83
+
84
+ # Restart your API server
85
+ sudo systemctl restart rkllm-api
86
+ ```
87
+
88
+ ### Using with RKLLM API Server
89
+
90
+ This model is designed for use with the [RKLLM API Server](https://github.com/jdacostap/rkllm-api), which provides an OpenAI-compatible API for RK3588 NPU inference. The server auto-detects the vision encoder resolution from the `.rknn` file's input tensor attributes.
91
+
92
+ ## Export Details
93
+
94
+ ### LLM Decoder
95
+
96
+ - **Source**: [Qwen/Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct)
97
+ - **Quantization**: W8A8 (8-bit weights, 8-bit activations)
98
+ - **Tool**: rkllm-toolkit v1.2.3
99
+ - **Context**: 4096 tokens
100
+
101
+ ### Vision Encoders
102
+
103
+ - **Source**: Qwen3-VL-2B-Instruct visual encoder weights
104
+ - **Export pipeline**: HuggingFace model → ONNX (`export_vision.py`) → RKNN (`export_vision_rknn.py`)
105
+ - **Tool**: rknn-toolkit2 v2.3.2
106
+ - **Precision**: FP32 (no quantization — vision encoder quality is critical)
107
+ - **Target**: rk3588
108
+
109
+ The 448 encoder was converted with default settings from rknn-llm. The 672 and 896 encoders were re-exported with custom `--height` and `--width` flags to `export_vision.py` and `export_vision_rknn.py` from the [rknn-llm multimodal demo](https://github.com/airockchip/rknn-llm/tree/main/examples/multimodal_model_demo/export).
110
+
111
+ ### Re-exporting at a Custom Resolution
112
+
113
+ To export the vision encoder at a different resolution (must be divisible by 32):
114
+
115
+ ```bash
116
+ # Activate the export environment
117
+ source ~/rkllm-env/bin/activate
118
+ cd ~/rknn-llm/examples/multimodal_model_demo
119
+
120
+ # Step 1: Export HuggingFace model to ONNX
121
+ python3 export/export_vision.py \
122
+ --path ~/models-hf/Qwen3-VL-2B-Instruct \
123
+ --model_name qwen3-vl \
124
+ --height 672 --width 672 \
125
+ --device cpu
126
+
127
+ # Step 2: Convert ONNX to RKNN
128
+ python3 export/export_vision_rknn.py \
129
+ --path ./onnx/qwen3-vl_vision.onnx \
130
+ --model_name qwen3-vl \
131
+ --target-platform rk3588 \
132
+ --height 672 --width 672
133
+ ```
134
+
135
+ **Memory requirements**: ~20 GB RAM (or swap) for 672×672, ~35 GB for 896×896. CPU-only export works fine (no GPU needed).
136
+
137
+ **Dependencies** (in a Python 3.10 venv):
138
+ - `rknn-toolkit2 >= 2.3.2`
139
+ - `torch == 2.4.0`
140
+ - `transformers >= 4.57.0`
141
+ - `onnx >= 1.18.0`
142
+
143
+ ## Performance Benchmarks
144
+
145
+ Tested on **Orange Pi 5 Plus (16GB RAM)**, RK3588 SoC, RKNPU driver 0.9.8:
146
+
147
+ | Metric | 448×448 | 672×672 | 896×896 |
148
+ |---|---|---|---|
149
+ | Vision encode time | ~2 s | ~4 s | ~12 s |
150
+ | Total VL response | 5–10 s | 9–11 s | 25–28 s |
151
+ | Text-only decode | ~15 tok/s | ~15 tok/s | ~15 tok/s |
152
+ | Peak RAM (VL inference) | ~5.5 GB | ~6.5 GB | ~8.5 GB |
153
+ | RKNN file size | 812 MB | 854 MB | 923 MB |
154
+
155
+ ## Known Limitations
156
+
157
+ - **OCR accuracy**: The 2B-parameter LLM is the bottleneck for OCR tasks, not the vision encoder resolution. Higher resolution helps with fine detail but the model may still misread characters.
158
+ - **Fixed resolution**: Each `.rknn` file is compiled for a specific input resolution. Images are automatically resized (with aspect-ratio-preserving padding) to match. There is no dynamic resolution switching within a single model file.
159
+ - **REGTASK warnings**: The 672 and 896 encoders produce "bit width exceeds limit" register-field warnings during RKNN conversion. These are cosmetic in rknn-toolkit2 v2.3.2 and do not affect runtime inference on the RK3588.
160
+
161
+ ## License
162
+
163
+ Apache 2.0, inherited from [Qwen3-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct).
164
+
165
+ ## Credits
166
+
167
+ - **Model**: [Qwen Team](https://huggingface.co/Qwen) for Qwen3-VL-2B-Instruct
168
+ - **Runtime**: [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for rkllm-toolkit and rknn-toolkit2
169
+ - **API Server**: [RKLLM API Server](https://github.com/jdacostap/rkllm-api) — OpenAI-compatible server for RK3588 NPU
qwen3-vl-2b-instruct_w8a8_rk3588.rkllm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d5474340221fc495c70e1ec2c7dafc4ebf88292ce466db7e771e3a20b99cf21f
3
+ size 2375022956
qwen3-vl-2b_vision_448_rk3588.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d707ef5dbf0e420ac48e57b5bf0ed6c0fd1d5d048c29d81e1b5a8d051ab7ea8
3
+ size 850488413
qwen3-vl-2b_vision_672_rk3588.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f4e6fb4baeb27fa4e2b88b311716b166cc000c10ec218c91e70a4bdb1db3dfe9
3
+ size 894505821
qwen3-vl-2b_vision_896_rk3588.rknn ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b395dee006d26ac8e7a97b2d9e154473a8c64b8b5594888957e67864d047cc01
3
+ size 967465181
upload.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Upload model files to HuggingFace repo."""
3
+ import os
4
+ from huggingface_hub import HfApi
5
+
6
+ api = HfApi()
7
+ repo_id = "GatekeeperZA/Qwen3-VL-2B-Instruct-RKLLM-v1.2.3"
8
+ upload_dir = os.path.expanduser("~/hf-upload")
9
+
10
+ print(f"Uploading all files from {upload_dir} to {repo_id}...")
11
+ files = os.listdir(upload_dir)
12
+ for f in sorted(files):
13
+ size_mb = os.path.getsize(os.path.join(upload_dir, f)) / 1024 / 1024
14
+ print(f" {f} ({size_mb:.0f} MB)")
15
+
16
+ api.upload_folder(
17
+ folder_path=upload_dir,
18
+ repo_id=repo_id,
19
+ repo_type="model",
20
+ commit_message="Add RKLLM v1.2.3 model files: LLM decoder (W8A8) + vision encoders at 448/672/896",
21
+ )
22
+ print("Upload complete!")