File size: 6,526 Bytes
3d5f9a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---

license: other
license_name: internlm-license
license_link: https://huggingface.co/internlm/internlm2-chat-1_8b/blob/main/LICENSE
base_model: internlm/internlm2-chat-1_8b
tags:
  - internlm2
  - rk3588
  - npu
  - rockchip
  - quantized
  - w8a8
  - rkllm
  - edge
language:
  - en
  - zh
pipeline_tag: text-generation
library_name: rkllm
---


# InternLM2-Chat-1.8B β€” RKLLM v1.2.3 (w8a8, RK3588)

RKLLM conversion of [internlm/internlm2-chat-1_8b](https://huggingface.co/internlm/internlm2-chat-1_8b) for Rockchip RK3588 NPU inference.

Converted with **RKLLM Toolkit v1.2.3**. This model provides a different architecture option alongside Qwen3 models on the RK3588, offering strong multilingual support (English + Chinese) and good general-purpose chat capability at ~15.6 tokens/sec.

## Key Details

| | |
|---|---|
| **Base Model** | internlm/internlm2-chat-1_8b |

| **Parameters** | 1.8B |

| **Toolkit Version** | RKLLM Toolkit v1.2.3 |

| **Runtime Version** | RKLLM Runtime β‰₯ v1.2.0 (v1.2.3 recommended) |

| **Quantization** | w8a8 (8-bit weights, 8-bit activations) |

| **Quantization Algorithm** | normal |

| **Target Platform** | RK3588 |

| **NPU Cores** | 3 |

| **Max Context Length** | 4,096 tokens |

| **Optimization Level** | 1 |

| **Thinking Mode** | ❌ Not supported (standard instruct model) |

| **Languages** | English, Chinese |



## Performance (RK3588 Official Benchmark)



From the [RKLLM v1.2.3 benchmark](https://github.com/airockchip/rknn-llm/blob/main/benchmark.md) (w8a8, SeqLen=128, New_tokens=64):

| Metric | Value |
|--------|-------|
| **Decode Speed** | 15.58 tokens/sec |
| **Prefill (TTFT)** | 374 ms |
| **Memory Usage** | ~1,766 MB |

## Why InternLM2-1.8B?

InternLM2 brings **architectural diversity** to an RK3588 model lineup. If you already run Qwen3 models, adding InternLM2 gives you a different model family with its own strengths:

- **Strong bilingual capability** β€” trained extensively on both English and Chinese data
- **Good instruction following** β€” RLHF-aligned for chat applications
- **Efficient memory usage** β€” ~1,766 MB is significantly less than 3-4B models (~3.7-4.3 GB)
- **Fast inference** β€” 15.58 tok/s is solidly in the "responsive chat" bracket
- **200K native context** β€” the base model supports ultra-long contexts (RKLLM conversion caps at 4K for NPU efficiency, but the architecture handles long dependencies well)

### Benchmarks (Base Model)

| Benchmark | InternLM2-Chat-1.8B | InternLM2-1.8B (base) |
|-----------|---------------------|----------------------|
| MMLU | 47.1 | 46.9 |
| AGIEval | 38.8 | 33.4 |
| BBH | 35.2 | 37.5 |
| GSM8K | 39.7 | 31.2 |
| MATH | 11.8 | 5.6 |
| HumanEval | 32.9 | 25.0 |
| MBPP (Sanitized) | 23.2 | 22.2 |

Source: [OpenCompass](https://github.com/open-compass/opencompass)

## Hardware Tested

- **Orange Pi 5 Plus** β€” RK3588, 16 GB RAM, Armbian Linux
- RKNPU driver 0.9.8
- RKLLM Runtime v1.2.3

## Usage

### 1. Download

Place the `.rkllm` file in a model directory on your RK3588 board:

```bash

mkdir -p ~/models/InternLM2-1.8B

cd ~/models/InternLM2-1.8B

# Copy the .rkllm file into this directory

```

### 2. Run with the official RKLLM API demo

```bash

# Clone the runtime

git clone https://github.com/airockchip/rknn-llm.git

cd rknn-llm/examples/rkllm_api_demo



# Run (aarch64)

./build/rkllm_api_demo /path/to/InternLM2-1.8B-w8a8-rk3588.rkllm 2048 4096

```

### 3. Chat template

InternLM2 uses the following chat format:

```

<|im_start|>system

You are a helpful assistant.<|im_end|>

<|im_start|>user

How does photosynthesis work?<|im_end|>

<|im_start|>assistant

```

The RKLLM runtime handles this automatically β€” no manual template needed.

### 4. With a custom OpenAI-compatible server

Any server that wraps the RKLLM binary/library will work. The model responds to standard chat completion requests. See the [RKLLM API Server](https://github.com/GatekeeperZA/RKLLM-API-Server) project for a full OpenAI-compatible implementation with multi-model support.

## Conversion Script

```python

from rkllm.api import RKLLM



model_path = "internlm/internlm2-chat-1_8b"  # or local path

output_path = "./InternLM2-1.8B-w8a8-rk3588.rkllm"

dataset_path = "./data_quant.json"  # calibration data



# Load

llm = RKLLM()

llm.load_huggingface(model=model_path, model_lora=None, device="cpu")



# Build

llm.build(

    do_quantization=True,

    optimization_level=1,

    quantized_dtype="w8a8",

    quantized_algorithm="normal",

    target_platform="rk3588",

    num_npu_core=3,

    extra_qparams=None,

    dataset=dataset_path,

    max_context=4096,

)



# Export

llm.export_rkllm(output_path)

```

Calibration dataset: 21 diverse prompt/completion pairs generated with `generate_data_quant.py` from the [rknn-llm examples](https://github.com/airockchip/rknn-llm/tree/main/examples/rkllm_api_demo/export).

## File Listing

| File | Description |
|------|-------------|
| `InternLM2-1.8B-w8a8-rk3588.rkllm` | Quantized model for RK3588 NPU |

## Compatibility Notes

- **Minimum runtime:** RKLLM Runtime v1.2.0. v1.2.3 recommended.
- **RKNPU driver:** β‰₯ 0.9.6
- **SoCs:** RK3588 / RK3588S (3 NPU cores). Not compatible with RK3576 (2 cores) without reconversion.
- **RAM:** ~1.8 GB loaded. Runs comfortably on 8 GB+ boards.
- **No thinking mode:** InternLM2 is a standard instruct/chat model β€” it does not produce `<think>…</think>` reasoning blocks. For thinking mode, use [Qwen3-1.7B-RKLLM-v1.2.3](https://huggingface.co/GatekeeperZA/Qwen3-1.7B-RKLLM-v1.2.3).

## Known Issues

- The folder name containing the model must **not** include dots (e.g., `InternLM2-1.8B` not `InternLM2.1.8B`) due to Python module import issues during conversion.
- InternLM2 uses a custom tokenizer (`trust_remote_code=True` required during conversion).

## Acknowledgements

- [InternLM Team (Shanghai AI Laboratory)](https://huggingface.co/internlm) for the base model
- [Rockchip / airockchip](https://github.com/airockchip/rknn-llm) for the RKLLM toolkit and runtime
- Converted by [GatekeeperZA](https://huggingface.co/GatekeeperZA)

## Citation

```bibtex

@misc{cai2024internlm2,

      title={InternLM2 Technical Report},

      author={Zheng Cai and Maosong Cao and Haojiong Chen and Kai Chen and others},

      year={2024},

      eprint={2403.17297},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```