InternLM2-1.8B-w8a8-RKLLM-v1.2.3 / README.md

GatekeeperZA

Add InternLM2-Chat-1.8B w8a8 RKLLM v1.2.3 for RK3588

3d5f9a3 3 days ago

6.53 kB

	---
	license: other
	license_name: internlm-license
	license_link: https://huggingface.co/internlm/internlm2-chat-1_8b/blob/main/LICENSE
	base_model: internlm/internlm2-chat-1_8b
	tags:
	- internlm2
	- rk3588
	- npu
	- rockchip
	- quantized
	- w8a8
	- rkllm
	- edge
	language:
	- en
	- zh
	pipeline_tag: text-generation
	library_name: rkllm
	---

	# InternLM2-Chat-1.8B — RKLLM v1.2.3 (w8a8, RK3588)

	RKLLM conversion of [internlm/internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) for Rockchip RK3588 NPU inference.

	Converted with RKLLM Toolkit v1.2.3. This model provides a different architecture option alongside Qwen3 models on the RK3588, offering strong multilingual support (English + Chinese) and good general-purpose chat capability at ~15.6 tokens/sec.

	## Key Details

	\| \| \|
	\|---\|---\|
	\| Base Model \| internlm/internlm2-chat-1_8b \|
	\| Parameters \| 1.8B \|
	\| Toolkit Version \| RKLLM Toolkit v1.2.3 \|
	\| Runtime Version \| RKLLM Runtime ≥ v1.2.0 (v1.2.3 recommended) \|
	\| Quantization \| w8a8 (8-bit weights, 8-bit activations) \|
	\| Quantization Algorithm \| normal \|
	\| Target Platform \| RK3588 \|
	\| NPU Cores \| 3 \|
	\| Max Context Length \| 4,096 tokens \|
	\| Optimization Level \| 1 \|
	\| Thinking Mode \| ❌ Not supported (standard instruct model) \|
	\| Languages \| English, Chinese \|

	## Performance (RK3588 Official Benchmark)

	From the [RKLLM v1.2.3 benchmark](https://github.com/airockchip/rknn-llm/blob/main/benchmark.md) (w8a8, SeqLen=128, New_tokens=64):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Decode Speed \| 15.58 tokens/sec \|
	\| Prefill (TTFT) \| 374 ms \|
	\| Memory Usage \| ~1,766 MB \|

	## Why InternLM2-1.8B?

	InternLM2 brings architectural diversity to an RK3588 model lineup. If you already run Qwen3 models, adding InternLM2 gives you a different model family with its own strengths:

	- Strong bilingual capability — trained extensively on both English and Chinese data
	- Good instruction following — RLHF-aligned for chat applications
	- Efficient memory usage — ~1,766 MB is significantly less than 3-4B models (~3.7-4.3 GB)
	- Fast inference — 15.58 tok/s is solidly in the "responsive chat" bracket
	- 200K native context — the base model supports ultra-long contexts (RKLLM conversion caps at 4K for NPU efficiency, but the architecture handles long dependencies well)

	### Benchmarks (Base Model)

	\| Benchmark \| InternLM2-Chat-1.8B \| InternLM2-1.8B (base) \|
	\|-----------\|---------------------\|----------------------\|
	\| MMLU \| 47.1 \| 46.9 \|
	\| AGIEval \| 38.8 \| 33.4 \|
	\| BBH \| 35.2 \| 37.5 \|
	\| GSM8K \| 39.7 \| 31.2 \|
	\| MATH \| 11.8 \| 5.6 \|
	\| HumanEval \| 32.9 \| 25.0 \|
	\| MBPP (Sanitized) \| 23.2 \| 22.2 \|

	Source: [OpenCompass](https://github.com/open-compass/opencompass)

	## Hardware Tested

	- Orange Pi 5 Plus — RK3588, 16 GB RAM, Armbian Linux
	- RKNPU driver 0.9.8
	- RKLLM Runtime v1.2.3

	## Usage

	### 1. Download

	Place the `.rkllm` file in a model directory on your RK3588 board:

	```bash
	mkdir -p ~/models/InternLM2-1.8B
	cd ~/models/InternLM2-1.8B
	# Copy the .rkllm file into this directory
	```

	### 2. Run with the official RKLLM API demo

	```bash
	# Clone the runtime
	git clone https://github.com/airockchip/rknn-llm.git
	cd rknn-llm/examples/rkllm_api_demo

	# Run (aarch64)
	./build/rkllm_api_demo /path/to/InternLM2-1.8B-w8a8-rk3588.rkllm 2048 4096
	```

	### 3. Chat template

	InternLM2 uses the following chat format:

	```
	<\|im_start\|>system
	You are a helpful assistant.<\|im_end\|>
	<\|im_start\|>user
	How does photosynthesis work?<\|im_end\|>
	<\|im_start\|>assistant
	```

	The RKLLM runtime handles this automatically — no manual template needed.

	### 4. With a custom OpenAI-compatible server

	Any server that wraps the RKLLM binary/library will work. The model responds to standard chat completion requests. See the [RKLLM API Server](https://github.com/GatekeeperZA/RKLLM-API-Server) project for a full OpenAI-compatible implementation with multi-model support.

	## Conversion Script

	```python
	from rkllm.api import RKLLM

	model_path = "internlm/internlm2-chat-1_8b" # or local path
	output_path = "./InternLM2-1.8B-w8a8-rk3588.rkllm"
	dataset_path = "./data_quant.json" # calibration data

	# Load
	llm = RKLLM()
	llm.load_huggingface(model=model_path, model_lora=None, device="cpu")

	# Build
	llm.build(
	do_quantization=True,
	optimization_level=1,
	quantized_dtype="w8a8",
	quantized_algorithm="normal",
	target_platform="rk3588",
	num_npu_core=3,
	extra_qparams=None,
	dataset=dataset_path,
	max_context=4096,
	)

	# Export
	llm.export_rkllm(output_path)
	```

	Calibration dataset: 21 diverse prompt/completion pairs generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).

	## File Listing

	\| File \| Description \|
	\|------\|-------------\|
	\| `InternLM2-1.8B-w8a8-rk3588.rkllm` \| Quantized model for RK3588 NPU \|

	## Compatibility Notes

	- Minimum runtime: RKLLM Runtime v1.2.0. v1.2.3 recommended.
	- RKNPU driver: ≥ 0.9.6
	- SoCs: RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
	- RAM: ~1.8 GB loaded. Runs comfortably on 8 GB+ boards.
	- No thinking mode: InternLM2 is a standard instruct/chat model — it does not produce `<think>…</think>` reasoning blocks. For thinking mode, use [Qwen3-1.7B-RKLLM-v1.2.3](https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3).

	## Known Issues

	- The folder name containing the model must not include dots (e.g., `InternLM2-1.8B` not `InternLM2.1.8B`) due to Python module import issues during conversion.
	- InternLM2 uses a custom tokenizer (`trust_remote_code=True` required during conversion).

	## Acknowledgements

	- [InternLM Team (Shanghai AI Laboratory)](https://huggingface.co/internlm) for the base model
	- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
	- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)

	## Citation

	```bibtex
	@misc{cai2024internlm2,
	title={InternLM2 Technical Report},
	author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and others},
	year={2024},
	eprint={2403.17297},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```