File size: 5,829 Bytes
88bda30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
---
license: apache-2.0
library_name: rkllm
base_model: Qwen/Qwen3-1.7B
tags:
  - rkllm
  - rk3588
  - npu
  - rockchip
  - qwen3
  - thinking
  - reasoning
  - quantized
  - edge-ai
  - orange-pi
model_name: Qwen3-1.7B-RKLLM-v1.2.3
pipeline_tag: text-generation
language:
  - en
  - zh
---

# Qwen3-1.7B — RKLLM v1.2.3 (w8a8, RK3588)

RKLLM conversion of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for Rockchip RK3588 NPU inference.

Converted with **RKLLM Toolkit v1.2.3**, which includes full **thinking mode support** — the model produces `<think>…</think>` reasoning blocks when used with compatible runtimes.

## Key Details

| Property | Value |
|---|---|
| **Base Model** | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
| **Toolkit Version** | RKLLM Toolkit v1.2.3 |
| **Runtime Version** | RKLLM Runtime ≥ v1.2.1 (v1.2.3 recommended) |
| **Quantization** | w8a8 (8-bit weights, 8-bit activations) |
| **Quantization Algorithm** | normal |
| **Target Platform** | RK3588 |
| **NPU Cores** | 3 |
| **Max Context Length** | 4096 tokens |
| **Optimization Level** | 1 |
| **Thinking Mode** | ✅ Supported |
| **Languages** | English, Chinese (+ others inherited from Qwen3) |

## Why This Conversion?

Previous Qwen3-1.7B RKLLM conversions on HuggingFace were built with **Toolkit v1.2.0**, which predates thinking mode support (added in v1.2.1). The chat template baked into those `.rkllm` files does not include the `<think>` trigger, so the model never produces reasoning output.

This conversion uses **Toolkit v1.2.3**, which correctly embeds the thinking-enabled chat template into the model file.

## Thinking Mode

Qwen3-1.7B is a hybrid thinking model. When served through an OpenAI-compatible API that parses `<think>` tags, reasoning content appears separately from the final answer — enabling UIs like Open WebUI to show a collapsible "Thinking…" section.

Example raw output:
```
<think>
The user is asking about the capital of France. This is a straightforward geography question.
</think>
The capital of France is Paris.
```

## Hardware Tested

- **Orange Pi 5 Plus** — RK3588, 16GB RAM, Armbian Linux
- RKNPU driver 0.9.8
- RKLLM Runtime v1.2.3

## Important: Enabling Thinking Mode

The RKLLM runtime requires **two things** for thinking mode to work:

### 1. Set `enable_thinking = true` in the C++ demo

The stock `llm_demo.cpp` uses `memset(&rkllm_input, 0, ...)` which defaults `enable_thinking` to `false`. You **must** add one line:

```cpp
rkllm_input.input_type = RKLLM_INPUT_PROMPT;
rkllm_input.enable_thinking = true;   // ← ADD THIS LINE
rkllm_input.role = "user";
rkllm_input.prompt_input = (char *)input_str.c_str();
```

If using the Python ctypes API (`flask_server.py` / `gradio_server.py`), set it on the `RKLLMInput` struct:
```python
rkllm_input.enable_thinking = ctypes.c_bool(True)
```

Without this, the runtime never triggers the thinking chat template and the model won't produce `<think>` tags.

### 2. Handle the `robot: ` output prefix

The compiled `llm_demo` binary outputs `robot: ` before the model's actual response text. If your server uses a timing-based guard to discard residual stdout data, the `<think>` tag may arrive fast enough to be incorrectly discarded along with the prefix. Make sure your output parser:

- Strips the `robot: ` prefix (in addition to any `LLM: ` prefix)
- Does **not** discard data containing `<think>` even if it arrives quickly after the prompt is sent

### Compiling natively on aarch64

If building directly on the board (not cross-compiling), ignore `build-linux.sh` and compile natively:

```bash
cd ~/rknn-llm/examples/rkllm_api_demo/deploy
g++ -O2 -o llm_demo src/llm_demo.cpp \
    -I../../../rkllm-runtime/Linux/librkllm_api/include \
    -L../../../rkllm-runtime/Linux/librkllm_api/aarch64 \
    -lrkllmrt -lpthread
```

## Usage

### With the official RKLLM API demo

```bash
# Clone the runtime
git clone https://github.com/airockchip/rknn-llm.git
cd rknn-llm/examples/rkllm_api_demo

# Run (aarch64)
./build/rkllm_api_demo /path/to/Qwen3-1.7B-w8a8-rk3588.rkllm 2048 4096
```

### With a custom OpenAI-compatible server

Any server that launches the RKLLM binary and parses `<think>` tags from the output stream will work. The model responds to standard chat completion requests.

## Conversion Script

```python
from rkllm.api import RKLLM

model_path = "Qwen/Qwen3-1.7B"  # or local path
output_path = "./Qwen3-1.7B-w8a8-rk3588.rkllm"
dataset_path = "./data_quant.json"  # calibration data

# Load
llm = RKLLM()
llm.load_huggingface(model=model_path, model_lora=None, device="cpu")

# Build
llm.build(
    do_quantization=True,
    optimization_level=1,
    quantized_dtype="w8a8",
    quantized_algorithm="normal",
    target_platform="rk3588",
    num_npu_core=3,
    extra_qparams=None,
    dataset=dataset_path,
    max_context=4096,
)

# Export
llm.export_rkllm(output_path)
```

Calibration dataset: 21 diverse prompt/completion pairs (English + Chinese) generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).

## File Listing

| File | Description |
|---|---|
| `Qwen3-1.7B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU |

## Compatibility Notes

- **Minimum runtime**: RKLLM Runtime v1.2.1 (for thinking mode). v1.2.3 recommended.
- **RKNPU driver**: ≥ 0.9.6
- **SoCs**: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
- **RAM**: ~2GB loaded. Runs comfortably on 8GB+ boards.

## Acknowledgements

- [Qwen Team](https://huggingface.co/Qwen) for the base model
- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)