Update README.md

updated — the critical enable_thinking requirement and the robot: prefix issue should be documented.

![image](https://cdn-uploads.huggingface.co/production/uploads/63a88c2032ed73936ebaacdc/8rhoWB2yMDBXeOq88MHn3.png)

Files changed (1) hide show

README.md +176 -135

README.md CHANGED Viewed

@@ -1,135 +1,176 @@
----
-license: apache-2.0
-library_name: rkllm
-base_model: Qwen/Qwen3-1.7B
-tags:
-  - rkllm
-  - rk3588
-  - npu
-  - rockchip
-  - qwen3
-  - thinking
-  - reasoning
-  - quantized
-  - edge-ai
-  - orange-pi
-model_name: Qwen3-1.7B-RKLLM-v1.2.3
-pipeline_tag: text-generation
-language:
-  - en
-  - zh
----
-# Qwen3-1.7B — RKLLM v1.2.3 (w8a8, RK3588)
-RKLLM conversion of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for Rockchip RK3588 NPU inference.
-Converted with **RKLLM Toolkit v1.2.3**, which includes full **thinking mode support** — the model produces `<think>…</think>` reasoning blocks when used with compatible runtimes.
-## Key Details
-| Property | Value |
-|---|---|
-| **Base Model** | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
-| **Toolkit Version** | RKLLM Toolkit v1.2.3 |
-| **Runtime Version** | RKLLM Runtime ≥ v1.2.1 (v1.2.3 recommended) |
-| **Quantization** | w8a8 (8-bit weights, 8-bit activations) |
-| **Quantization Algorithm** | normal |
-| **Target Platform** | RK3588 |
-| **NPU Cores** | 3 |
-| **Max Context Length** | 4096 tokens |
-| **Optimization Level** | 1 |
-| **Thinking Mode** | ✅ Supported |
-| **Languages** | English, Chinese (+ others inherited from Qwen3) |
-## Why This Conversion?
-Previous Qwen3-1.7B RKLLM conversions on HuggingFace were built with **Toolkit v1.2.0**, which predates thinking mode support (added in v1.2.1). The chat template baked into those `.rkllm` files does not include the `<think>` trigger, so the model never produces reasoning output.
-This conversion uses **Toolkit v1.2.3**, which correctly embeds the thinking-enabled chat template into the model file.
-## Thinking Mode
-Qwen3-1.7B is a hybrid thinking model. When served through an OpenAI-compatible API that parses `<think>` tags, reasoning content appears separately from the final answer — enabling UIs like Open WebUI to show a collapsible "Thinking…" section.
-Example raw output:
-```
-<think>
-The user is asking about the capital of France. This is a straightforward geography question.
-</think>
-The capital of France is Paris.
-```
-## Hardware Tested
-- **Orange Pi 5 Plus** — RK3588, 16GB RAM, Armbian Linux
-- RKNPU driver 0.9.8
-- RKLLM Runtime v1.2.3
-## Usage
-### With the official RKLLM API demo
-```bash
-# Clone the runtime
-git clone https://github.com/airockchip/rknn-llm.git
-cd rknn-llm/examples/rkllm_api_demo
-# Run (aarch64)
-./build/rkllm_api_demo /path/to/Qwen3-1.7B-w8a8-rk3588.rkllm 2048 4096
-```
-### With a custom OpenAI-compatible server
-Any server that launches the RKLLM binary and parses `<think>` tags from the output stream will work. The model responds to standard chat completion requests.
-## Conversion Script
-```python
-from rkllm.api import RKLLM
-model_path = "Qwen/Qwen3-1.7B"  # or local path
-output_path = "./Qwen3-1.7B-w8a8-rk3588.rkllm"
-dataset_path = "./data_quant.json"  # calibration data
-# Load
-llm = RKLLM()
-llm.load_huggingface(model=model_path, model_lora=None, device="cpu")
-# Build
-llm.build(
-    do_quantization=True,
-    optimization_level=1,
-    quantized_dtype="w8a8",
-    quantized_algorithm="normal",
-    target_platform="rk3588",
-    num_npu_core=3,
-    extra_qparams=None,
-    dataset=dataset_path,
-    max_context=4096,
-)
-# Export
-llm.export_rkllm(output_path)
-```
-Calibration dataset: 21 diverse prompt/completion pairs (English + Chinese) generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).
-## File Listing
-| File | Description |
-|---|---|
-| `Qwen3-1.7B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU |
-## Compatibility Notes
-- **Minimum runtime**: RKLLM Runtime v1.2.1 (for thinking mode). v1.2.3 recommended.
-- **RKNPU driver**: ≥ 0.9.6
-- **SoCs**: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
-- **RAM**: ~2GB loaded. Runs comfortably on 8GB+ boards.
-## Acknowledgements
-- [Qwen Team](https://huggingface.co/Qwen) for the base model
-- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
-- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)

+---
+license: apache-2.0
+library_name: rkllm
+base_model: Qwen/Qwen3-1.7B
+tags:
+  - rkllm
+  - rk3588
+  - npu
+  - rockchip
+  - qwen3
+  - thinking
+  - reasoning
+  - quantized
+  - edge-ai
+  - orange-pi
+model_name: Qwen3-1.7B-RKLLM-v1.2.3
+pipeline_tag: text-generation
+language:
+  - en
+  - zh
+---
+# Qwen3-1.7B — RKLLM v1.2.3 (w8a8, RK3588)
+RKLLM conversion of [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for Rockchip RK3588 NPU inference.
+Converted with **RKLLM Toolkit v1.2.3**, which includes full **thinking mode support** — the model produces `<think>…</think>` reasoning blocks when used with compatible runtimes.
+## Key Details
+| Property | Value |
+|---|---|
+| **Base Model** | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) |
+| **Toolkit Version** | RKLLM Toolkit v1.2.3 |
+| **Runtime Version** | RKLLM Runtime ≥ v1.2.1 (v1.2.3 recommended) |
+| **Quantization** | w8a8 (8-bit weights, 8-bit activations) |
+| **Quantization Algorithm** | normal |
+| **Target Platform** | RK3588 |
+| **NPU Cores** | 3 |
+| **Max Context Length** | 4096 tokens |
+| **Optimization Level** | 1 |
+| **Thinking Mode** | ✅ Supported |
+| **Languages** | English, Chinese (+ others inherited from Qwen3) |
+## Why This Conversion?
+Previous Qwen3-1.7B RKLLM conversions on HuggingFace were built with **Toolkit v1.2.0**, which predates thinking mode support (added in v1.2.1). The chat template baked into those `.rkllm` files does not include the `<think>` trigger, so the model never produces reasoning output.
+This conversion uses **Toolkit v1.2.3**, which correctly embeds the thinking-enabled chat template into the model file.
+## Thinking Mode
+Qwen3-1.7B is a hybrid thinking model. When served through an OpenAI-compatible API that parses `<think>` tags, reasoning content appears separately from the final answer — enabling UIs like Open WebUI to show a collapsible "Thinking…" section.
+Example raw output:
+```
+<think>
+The user is asking about the capital of France. This is a straightforward geography question.
+</think>
+The capital of France is Paris.
+```
+## Hardware Tested
+- **Orange Pi 5 Plus** — RK3588, 16GB RAM, Armbian Linux
+- RKNPU driver 0.9.8
+- RKLLM Runtime v1.2.3
+## Important: Enabling Thinking Mode
+The RKLLM runtime requires **two things** for thinking mode to work:
+### 1. Set `enable_thinking = true` in the C++ demo
+The stock `llm_demo.cpp` uses `memset(&rkllm_input, 0, ...)` which defaults `enable_thinking` to `false`. You **must** add one line:
+```cpp
+rkllm_input.input_type = RKLLM_INPUT_PROMPT;
+rkllm_input.enable_thinking = true;   // ← ADD THIS LINE
+rkllm_input.role = "user";
+rkllm_input.prompt_input = (char *)input_str.c_str();
+```
+If using the Python ctypes API (`flask_server.py` / `gradio_server.py`), set it on the `RKLLMInput` struct:
+```python
+rkllm_input.enable_thinking = ctypes.c_bool(True)
+```
+Without this, the runtime never triggers the thinking chat template and the model won't produce `<think>` tags.
+### 2. Handle the `robot: ` output prefix
+The compiled `llm_demo` binary outputs `robot: ` before the model's actual response text. If your server uses a timing-based guard to discard residual stdout data, the `<think>` tag may arrive fast enough to be incorrectly discarded along with the prefix. Make sure your output parser:
+- Strips the `robot: ` prefix (in addition to any `LLM: ` prefix)
+- Does **not** discard data containing `<think>` even if it arrives quickly after the prompt is sent
+### Compiling natively on aarch64
+If building directly on the board (not cross-compiling), ignore `build-linux.sh` and compile natively:
+```bash
+cd ~/rknn-llm/examples/rkllm_api_demo/deploy
+g++ -O2 -o llm_demo src/llm_demo.cpp \
+    -I../../../rkllm-runtime/Linux/librkllm_api/include \
+    -L../../../rkllm-runtime/Linux/librkllm_api/aarch64 \
+    -lrkllmrt -lpthread
+```
+## Usage
+### With the official RKLLM API demo
+```bash
+# Clone the runtime
+git clone https://github.com/airockchip/rknn-llm.git
+cd rknn-llm/examples/rkllm_api_demo
+# Run (aarch64)
+./build/rkllm_api_demo /path/to/Qwen3-1.7B-w8a8-rk3588.rkllm 2048 4096
+```
+### With a custom OpenAI-compatible server
+Any server that launches the RKLLM binary and parses `<think>` tags from the output stream will work. The model responds to standard chat completion requests.
+## Conversion Script
+```python
+from rkllm.api import RKLLM
+model_path = "Qwen/Qwen3-1.7B"  # or local path
+output_path = "./Qwen3-1.7B-w8a8-rk3588.rkllm"
+dataset_path = "./data_quant.json"  # calibration data
+# Load
+llm = RKLLM()
+llm.load_huggingface(model=model_path, model_lora=None, device="cpu")
+# Build
+llm.build(
+    do_quantization=True,
+    optimization_level=1,
+    quantized_dtype="w8a8",
+    quantized_algorithm="normal",
+    target_platform="rk3588",
+    num_npu_core=3,
+    extra_qparams=None,
+    dataset=dataset_path,
+    max_context=4096,
+)
+# Export
+llm.export_rkllm(output_path)
+```
+Calibration dataset: 21 diverse prompt/completion pairs (English + Chinese) generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).
+## File Listing
+| File | Description |
+|---|---|
+| `Qwen3-1.7B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU |
+## Compatibility Notes
+- **Minimum runtime**: RKLLM Runtime v1.2.1 (for thinking mode). v1.2.3 recommended.
+- **RKNPU driver**: ≥ 0.9.6
+- **SoCs**: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
+- **RAM**: ~2GB loaded. Runs comfortably on 8GB+ boards.
+## Acknowledgements
+- [Qwen Team](https://huggingface.co/Qwen) for the base model
+- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
+- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)