File size: 5,829 Bytes
88bda30 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ---
license: apache-2.0
library_name: rkllm
base_model: Qwen/Qwen3-1.7B
tags:
- rkllm
- rk3588
- npu
- rockchip
- qwen3
- thinking
- reasoning
- quantized
- edge-ai
- orange-pi
model_name: Qwen3-1.7B-RKLLM-v1.2.3
pipeline_tag: text-generation
language:
- en
- zh
---
# Qwen3-1.7B — RKLLM v1.2.3 (w8a8, RK3588)
RKLLM conversion of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for Rockchip RK3588 NPU inference.
Converted with **RKLLM Toolkit v1.2.3**, which includes full **thinking mode support** — the model produces `<think>…</think>` reasoning blocks when used with compatible runtimes.
## Key Details
| Property | Value |
|---|---|
| **Base Model** | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
| **Toolkit Version** | RKLLM Toolkit v1.2.3 |
| **Runtime Version** | RKLLM Runtime ≥ v1.2.1 (v1.2.3 recommended) |
| **Quantization** | w8a8 (8-bit weights, 8-bit activations) |
| **Quantization Algorithm** | normal |
| **Target Platform** | RK3588 |
| **NPU Cores** | 3 |
| **Max Context Length** | 4096 tokens |
| **Optimization Level** | 1 |
| **Thinking Mode** | ✅ Supported |
| **Languages** | English, Chinese (+ others inherited from Qwen3) |
## Why This Conversion?
Previous Qwen3-1.7B RKLLM conversions on HuggingFace were built with **Toolkit v1.2.0**, which predates thinking mode support (added in v1.2.1). The chat template baked into those `.rkllm` files does not include the `<think>` trigger, so the model never produces reasoning output.
This conversion uses **Toolkit v1.2.3**, which correctly embeds the thinking-enabled chat template into the model file.
## Thinking Mode
Qwen3-1.7B is a hybrid thinking model. When served through an OpenAI-compatible API that parses `<think>` tags, reasoning content appears separately from the final answer — enabling UIs like Open WebUI to show a collapsible "Thinking…" section.
Example raw output:
```
<think>
The user is asking about the capital of France. This is a straightforward geography question.
</think>
The capital of France is Paris.
```
## Hardware Tested
- **Orange Pi 5 Plus** — RK3588, 16GB RAM, Armbian Linux
- RKNPU driver 0.9.8
- RKLLM Runtime v1.2.3
## Important: Enabling Thinking Mode
The RKLLM runtime requires **two things** for thinking mode to work:
### 1. Set `enable_thinking = true` in the C++ demo
The stock `llm_demo.cpp` uses `memset(&rkllm_input, 0, ...)` which defaults `enable_thinking` to `false`. You **must** add one line:
```cpp
rkllm_input.input_type = RKLLM_INPUT_PROMPT;
rkllm_input.enable_thinking = true; // ← ADD THIS LINE
rkllm_input.role = "user";
rkllm_input.prompt_input = (char *)input_str.c_str();
```
If using the Python ctypes API (`flask_server.py` / `gradio_server.py`), set it on the `RKLLMInput` struct:
```python
rkllm_input.enable_thinking = ctypes.c_bool(True)
```
Without this, the runtime never triggers the thinking chat template and the model won't produce `<think>` tags.
### 2. Handle the `robot: ` output prefix
The compiled `llm_demo` binary outputs `robot: ` before the model's actual response text. If your server uses a timing-based guard to discard residual stdout data, the `<think>` tag may arrive fast enough to be incorrectly discarded along with the prefix. Make sure your output parser:
- Strips the `robot: ` prefix (in addition to any `LLM: ` prefix)
- Does **not** discard data containing `<think>` even if it arrives quickly after the prompt is sent
### Compiling natively on aarch64
If building directly on the board (not cross-compiling), ignore `build-linux.sh` and compile natively:
```bash
cd ~/rknn-llm/examples/rkllm_api_demo/deploy
g++ -O2 -o llm_demo src/llm_demo.cpp \
-I../../../rkllm-runtime/Linux/librkllm_api/include \
-L../../../rkllm-runtime/Linux/librkllm_api/aarch64 \
-lrkllmrt -lpthread
```
## Usage
### With the official RKLLM API demo
```bash
# Clone the runtime
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/examples/rkllm_api_demo
# Run (aarch64)
./build/rkllm_api_demo /path/to/Qwen3-1.7B-w8a8-rk3588.rkllm 2048 4096
```
### With a custom OpenAI-compatible server
Any server that launches the RKLLM binary and parses `<think>` tags from the output stream will work. The model responds to standard chat completion requests.
## Conversion Script
```python
from rkllm.api import RKLLM
model_path = "Qwen/Qwen3-1.7B" # or local path
output_path = "./Qwen3-1.7B-w8a8-rk3588.rkllm"
dataset_path = "./data_quant.json" # calibration data
# Load
llm = RKLLM()
llm.load_huggingface(model=model_path, model_lora=None, device="cpu")
# Build
llm.build(
do_quantization=True,
optimization_level=1,
quantized_dtype="w8a8",
quantized_algorithm="normal",
target_platform="rk3588",
num_npu_core=3,
extra_qparams=None,
dataset=dataset_path,
max_context=4096,
)
# Export
llm.export_rkllm(output_path)
```
Calibration dataset: 21 diverse prompt/completion pairs (English + Chinese) generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).
## File Listing
| File | Description |
|---|---|
| `Qwen3-1.7B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU |
## Compatibility Notes
- **Minimum runtime**: RKLLM Runtime v1.2.1 (for thinking mode). v1.2.3 recommended.
- **RKNPU driver**: ≥ 0.9.6
- **SoCs**: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
- **RAM**: ~2GB loaded. Runs comfortably on 8GB+ boards.
## Acknowledgements
- [Qwen Team](https://huggingface.co/Qwen) for the base model
- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)
|