File size: 7,727 Bytes
8b50404
 
 
 
 
 
 
 
 
 
 
815cf17
8b50404
 
 
2007eaa
8b50404
2007eaa
 
 
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b50404
2007eaa
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
 
 
 
8b50404
2007eaa
8b50404
2007eaa
 
 
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
 
 
8b50404
 
2007eaa
8b50404
2007eaa
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b50404
2007eaa
 
 
 
 
 
 
 
 
 
 
 
 
 
8b50404
2007eaa
8b50404
2007eaa
8b50404
2007eaa
 
8b50404
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
library_name: transformers
license: bsd-3-clause
---

# DeepSeek-R1-Distill-Qwen-7B-AX650

- This version of DeepSeek-R1-Distill-Qwen-7B has been converted to run on the Axera NPU using w8a16 quantization.

- This model has been optimized with the following LoRA: 

  - Compatible with Pulsar2 version: 4.2

- Due to the current quantization scheme of w8a16, the CMM consumes about 7.6GiB of memory, so a 16GiB development board is required to run.

## Feature

- Support for longer contexts, in this sample it's 2k
- Support context dialogue
- System prompt kvcache is supported

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B and https://huggingface.co/jakiAJK/DeepSeek-R1-Distill-Qwen-7B_GPTQ-int4

[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html) 

[AXera NPU AXEngine LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/ax-context) 

[AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/axcl-context) 

### Convert script

The follow show how to convert DeepSeek-R1-Distill-Qwen-7B

```
pulsar2 llm_build --input_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B  \
                  --output_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B-ax650 \
                  --hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
                  --last_kv_cache_len 128 \
                  --last_kv_cache_len 256 \
                  --last_kv_cache_len 384 \
                  --last_kv_cache_len 512 \
                  --last_kv_cache_len 640 \
                  --last_kv_cache_len 768 \
                  --last_kv_cache_len 896 \
                  --last_kv_cache_len 1024 \
                  --last_kv_cache_len 1152 \
                  --last_kv_cache_len 1280 \
                  --last_kv_cache_len 1408 \
                  --last_cache_len 1536 \
                  --chip AX650 -c 1 --parallel 8
```

## Support Platform

- AX650
  - AX650N DEMO Board
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
  - *TBD*
 
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 2.6 tokens/sec| 4.8 tokens/sec |

## How to use

Download all files from this repository to the device

```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# tree -L 1
.
|-- README.md
|-- config.json
|-- deepseek-r1-7b-ax650
|-- deepseek-r1-7b-int4-ax650
|-- deepseek-r1_tokenizer
|-- deepseek-r1_tokenizer.py
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- run_deepseek-r1_7b_ax650.sh
|-- run_deepseek-r1_7b_axcl_aarch64.sh
|-- run_deepseek-r1_7b_axcl_x86.sh
|-- run_deepseek-r1_7b_int4_ax650.sh
|-- run_deepseek-r1_7b_int4_axcl_aarch64.sh
`-- run_deepseek-r1_7b_int4_axcl_x86.sh

3 directories, 13 files

```

#### Start the Tokenizer service

```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# python3 deepseek-r1_tokenizer.py
Server running at http://0.0.0.0:12345
```

#### System prompt cache

- The System prompt can be preset through the configuration file from `--system_prompt`
- The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from `--kvcache_path`
- This folder needs to be created manually before running, for example `mkdir kvcache`

```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# cat ./run_deepseek-r1_7b_int4_ax650.sh
./main_ax650 \
--template_filename_axmodel "deepseek-r1-7b-int4-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 28 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "deepseek-r1-7b-int4-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "deepseek-r1-7b-int4-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 152064 \
--tokens_embed_size 3584 \
--use_mmap_load_embed 1 \
--live_print 1
```

#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board

Open another terminal and run `run_deepseek-r1_7b_int4_ax650.sh`

```
root@ax650:~/huggingface/DeepSeek-R1-Distill-Qwen-7B# ./run_deepseek-r1_7b_int4_ax650.sh
[I][                            Init][ 110]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:12345 ok
[I][                            Init][  57]: uid: e034d25e-4fcb-4c3b-b19a-df31c278d9a8
bos_id: 151646, eos_id: 151643
  3% | ██                                |   1 /  31 [2.16s<67.02s, 0.46 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  31 /  31 [21.75s<21.75s, 1.43 count/s] init post axmodel ok,remain_cmm(4189 MB)[I][                            Init][ 188]: max_token_len : 2047
[I][                            Init][ 193]: kv_cache_size : 512, kv_cache_num: 2047
[I][                            Init][ 201]: prefill_token_num : 128
[I][                            Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 205]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 205]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 205]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 205]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 205]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 205]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 205]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 205]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 209]: prefill_max_token_num : 1024
[I][                     load_config][ 282]: load config:
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 20,
    "repetition_penalty": 1.2,
    "temperature": 0.9,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][          GenerateKVCachePrefill][ 275]: input token num : 13, prefill_split_num : 1 prefill_grpid : 2
[I][          GenerateKVCachePrefill][ 315]: input_num_token:13
[I][                            main][ 228]: precompute_len: 13
[I][                            main][ 229]: system_prompt:
prompt >> 你是谁
[I][                      SetKVCache][ 529]: prefill_grpid:2 kv_cache_num:128 precompute_len:13 input_num_token:6
[I][                      SetKVCache][ 532]: current prefill_max_token_num:896
[I][                             Run][ 658]: input token num : 6, prefill_split_num : 1
[I][                             Run][ 684]: input_num_token:6
[I][                             Run][ 807]: ttft: 764.85 ms
Alright, the user greeted me by saying, "You are DeepSeek. You are a helpful assistant." I need to respond in a friendly and professional manner. I should acknowledge that I'm DeepSeek, an AI assistant, and offer assistance. I'll keep it concise and welcoming.
</think>

您好!我是DeepSeek,一个由深度求索公司开发的智能助手。我随时准备为您提供帮助和解答。请问有什么可以为您服务的?

[N][                             Run][ 921]: hit eos,avg 4.87 token/s

[I][                      GetKVCache][ 498]: precompute_len:110, remaining:914
prompt >> q

```