|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: rkllm |
|
|
base_model: Qwen/Qwen3-1.7B |
|
|
tags: |
|
|
- rkllm |
|
|
- rk3588 |
|
|
- npu |
|
|
- rockchip |
|
|
- qwen3 |
|
|
- thinking |
|
|
- reasoning |
|
|
- quantized |
|
|
- edge-ai |
|
|
- orange-pi |
|
|
model_name: Qwen3-1.7B-RKLLM-v1.2.3 |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
--- |
|
|
|
|
|
# Qwen3-1.7B — RKLLM v1.2.3 (w8a8, RK3588) |
|
|
|
|
|
RKLLM conversion of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for Rockchip RK3588 NPU inference. |
|
|
|
|
|
Converted with **RKLLM Toolkit v1.2.3**, which includes full **thinking mode support** — the model produces `<think>…</think>` reasoning blocks when used with compatible runtimes. |
|
|
|
|
|
## Key Details |
|
|
|
|
|
| Property | Value | |
|
|
|---|---| |
|
|
| **Base Model** | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | |
|
|
| **Toolkit Version** | RKLLM Toolkit v1.2.3 | |
|
|
| **Runtime Version** | RKLLM Runtime ≥ v1.2.1 (v1.2.3 recommended) | |
|
|
| **Quantization** | w8a8 (8-bit weights, 8-bit activations) | |
|
|
| **Quantization Algorithm** | normal | |
|
|
| **Target Platform** | RK3588 | |
|
|
| **NPU Cores** | 3 | |
|
|
| **Max Context Length** | 4096 tokens | |
|
|
| **Optimization Level** | 1 | |
|
|
| **Thinking Mode** | ✅ Supported | |
|
|
| **Languages** | English, Chinese (+ others inherited from Qwen3) | |
|
|
|
|
|
## Why This Conversion? |
|
|
|
|
|
Previous Qwen3-1.7B RKLLM conversions on HuggingFace were built with **Toolkit v1.2.0**, which predates thinking mode support (added in v1.2.1). The chat template baked into those `.rkllm` files does not include the `<think>` trigger, so the model never produces reasoning output. |
|
|
|
|
|
This conversion uses **Toolkit v1.2.3**, which correctly embeds the thinking-enabled chat template into the model file. |
|
|
|
|
|
## Thinking Mode |
|
|
|
|
|
Qwen3-1.7B is a hybrid thinking model. When served through an OpenAI-compatible API that parses `<think>` tags, reasoning content appears separately from the final answer — enabling UIs like Open WebUI to show a collapsible "Thinking…" section. |
|
|
|
|
|
Example raw output: |
|
|
``` |
|
|
<think> |
|
|
The user is asking about the capital of France. This is a straightforward geography question. |
|
|
</think> |
|
|
The capital of France is Paris. |
|
|
``` |
|
|
|
|
|
## Hardware Tested |
|
|
|
|
|
- **Orange Pi 5 Plus** — RK3588, 16GB RAM, Armbian Linux |
|
|
- RKNPU driver 0.9.8 |
|
|
- RKLLM Runtime v1.2.3 |
|
|
|
|
|
## Important: Enabling Thinking Mode |
|
|
|
|
|
The RKLLM runtime requires **two things** for thinking mode to work: |
|
|
|
|
|
### 1. Set `enable_thinking = true` in the C++ demo |
|
|
|
|
|
The stock `llm_demo.cpp` uses `memset(&rkllm_input, 0, ...)` which defaults `enable_thinking` to `false`. You **must** add one line: |
|
|
|
|
|
```cpp |
|
|
rkllm_input.input_type = RKLLM_INPUT_PROMPT; |
|
|
rkllm_input.enable_thinking = true; // ← ADD THIS LINE |
|
|
rkllm_input.role = "user"; |
|
|
rkllm_input.prompt_input = (char *)input_str.c_str(); |
|
|
``` |
|
|
|
|
|
If using the Python ctypes API (`flask_server.py` / `gradio_server.py`), set it on the `RKLLMInput` struct: |
|
|
```python |
|
|
rkllm_input.enable_thinking = ctypes.c_bool(True) |
|
|
``` |
|
|
|
|
|
Without this, the runtime never triggers the thinking chat template and the model won't produce `<think>` tags. |
|
|
|
|
|
### 2. Handle the `robot: ` output prefix |
|
|
|
|
|
The compiled `llm_demo` binary outputs `robot: ` before the model's actual response text. If your server uses a timing-based guard to discard residual stdout data, the `<think>` tag may arrive fast enough to be incorrectly discarded along with the prefix. Make sure your output parser: |
|
|
|
|
|
- Strips the `robot: ` prefix (in addition to any `LLM: ` prefix) |
|
|
- Does **not** discard data containing `<think>` even if it arrives quickly after the prompt is sent |
|
|
|
|
|
### Compiling natively on aarch64 |
|
|
|
|
|
If building directly on the board (not cross-compiling), ignore `build-linux.sh` and compile natively: |
|
|
|
|
|
```bash |
|
|
cd ~/rknn-llm/examples/rkllm_api_demo/deploy |
|
|
g++ -O2 -o llm_demo src/llm_demo.cpp \ |
|
|
-I../../../rkllm-runtime/Linux/librkllm_api/include \ |
|
|
-L../../../rkllm-runtime/Linux/librkllm_api/aarch64 \ |
|
|
-lrkllmrt -lpthread |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With the official RKLLM API demo |
|
|
|
|
|
```bash |
|
|
# Clone the runtime |
|
|
git clone https://github.com/airockchip/rknn-llm.git |
|
|
cd rknn-llm/examples/rkllm_api_demo |
|
|
|
|
|
# Run (aarch64) |
|
|
./build/rkllm_api_demo /path/to/Qwen3-1.7B-w8a8-rk3588.rkllm 2048 4096 |
|
|
``` |
|
|
|
|
|
### With a custom OpenAI-compatible server |
|
|
|
|
|
Any server that launches the RKLLM binary and parses `<think>` tags from the output stream will work. The model responds to standard chat completion requests. |
|
|
|
|
|
## Conversion Script |
|
|
|
|
|
```python |
|
|
from rkllm.api import RKLLM |
|
|
|
|
|
model_path = "Qwen/Qwen3-1.7B" # or local path |
|
|
output_path = "./Qwen3-1.7B-w8a8-rk3588.rkllm" |
|
|
dataset_path = "./data_quant.json" # calibration data |
|
|
|
|
|
# Load |
|
|
llm = RKLLM() |
|
|
llm.load_huggingface(model=model_path, model_lora=None, device="cpu") |
|
|
|
|
|
# Build |
|
|
llm.build( |
|
|
do_quantization=True, |
|
|
optimization_level=1, |
|
|
quantized_dtype="w8a8", |
|
|
quantized_algorithm="normal", |
|
|
target_platform="rk3588", |
|
|
num_npu_core=3, |
|
|
extra_qparams=None, |
|
|
dataset=dataset_path, |
|
|
max_context=4096, |
|
|
) |
|
|
|
|
|
# Export |
|
|
llm.export_rkllm(output_path) |
|
|
``` |
|
|
|
|
|
Calibration dataset: 21 diverse prompt/completion pairs (English + Chinese) generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export). |
|
|
|
|
|
## File Listing |
|
|
|
|
|
| File | Description | |
|
|
|---|---| |
|
|
| `Qwen3-1.7B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU | |
|
|
|
|
|
## Compatibility Notes |
|
|
|
|
|
- **Minimum runtime**: RKLLM Runtime v1.2.1 (for thinking mode). v1.2.3 recommended. |
|
|
- **RKNPU driver**: ≥ 0.9.6 |
|
|
- **SoCs**: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion. |
|
|
- **RAM**: ~2GB loaded. Runs comfortably on 8GB+ boards. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- [Qwen Team](https://huggingface.co/Qwen) for the base model |
|
|
- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime |
|
|
- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA) |
|
|
|