File size: 7,727 Bytes
8b50404 815cf17 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 2007eaa 8b50404 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
library_name: transformers
license: bsd-3-clause
---
# DeepSeek-R1-Distill-Qwen-7B-AX650
- This version of DeepSeek-R1-Distill-Qwen-7B has been converted to run on the Axera NPU using w8a16 quantization.
- This model has been optimized with the following LoRA:
- Compatible with Pulsar2 version: 4.2
- Due to the current quantization scheme of w8a16, the CMM consumes about 7.6GiB of memory, so a 16GiB development board is required to run.
## Feature
- Support for longer contexts, in this sample it's 2k
- Support context dialogue
- System prompt kvcache is supported
## Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B and https://huggingface.co/jakiAJK/DeepSeek-R1-Distill-Qwen-7B_GPTQ-int4
[Pulsar2 Link, How to Convert LLM from Huggingface to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/appendix/build_llm.html)
[AXera NPU AXEngine LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/ax-context)
[AXera NPU AXCL LLM Runtime](https://github.com/AXERA-TECH/ax-llm/tree/axcl-context)
### Convert script
The follow show how to convert DeepSeek-R1-Distill-Qwen-7B
```
pulsar2 llm_build --input_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
--output_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B-ax650 \
--hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
--last_kv_cache_len 128 \
--last_kv_cache_len 256 \
--last_kv_cache_len 384 \
--last_kv_cache_len 512 \
--last_kv_cache_len 640 \
--last_kv_cache_len 768 \
--last_kv_cache_len 896 \
--last_kv_cache_len 1024 \
--last_kv_cache_len 1152 \
--last_kv_cache_len 1280 \
--last_kv_cache_len 1408 \
--last_cache_len 1536 \
--chip AX650 -c 1 --parallel 8
```
## Support Platform
- AX650
- AX650N DEMO Board
- [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
- [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
- AX630C
- *TBD*
|Chips|w8a16|w4a16|
|--|--|--|
|AX650| 2.6 tokens/sec| 4.8 tokens/sec |
## How to use
Download all files from this repository to the device
```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# tree -L 1
.
|-- README.md
|-- config.json
|-- deepseek-r1-7b-ax650
|-- deepseek-r1-7b-int4-ax650
|-- deepseek-r1_tokenizer
|-- deepseek-r1_tokenizer.py
|-- main_ax650
|-- main_axcl_aarch64
|-- main_axcl_x86
|-- post_config.json
|-- run_deepseek-r1_7b_ax650.sh
|-- run_deepseek-r1_7b_axcl_aarch64.sh
|-- run_deepseek-r1_7b_axcl_x86.sh
|-- run_deepseek-r1_7b_int4_ax650.sh
|-- run_deepseek-r1_7b_int4_axcl_aarch64.sh
`-- run_deepseek-r1_7b_int4_axcl_x86.sh
3 directories, 13 files
```
#### Start the Tokenizer service
```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# python3 deepseek-r1_tokenizer.py
Server running at http://0.0.0.0:12345
```
#### System prompt cache
- The System prompt can be preset through the configuration file from `--system_prompt`
- The System prompt can be cached in the form of kv cache to a specified folder for quick loading at the next run time from `--kvcache_path`
- This folder needs to be created manually before running, for example `mkdir kvcache`
```
root@ax650:~/wangli/huggingface/DeepSeek-R1-Distill-Qwen-7B# cat ./run_deepseek-r1_7b_int4_ax650.sh
./main_ax650 \
--template_filename_axmodel "deepseek-r1-7b-int4-ax650/qwen2_p128_l%d_together.axmodel" \
--axmodel_num 28 \
--url_tokenizer_model "http://127.0.0.1:12345" \
--filename_post_axmodel "deepseek-r1-7b-int4-ax650/qwen2_post.axmodel" \
--filename_tokens_embed "deepseek-r1-7b-int4-ax650/model.embed_tokens.weight.bfloat16.bin" \
--tokens_embed_num 152064 \
--tokens_embed_size 3584 \
--use_mmap_load_embed 1 \
--live_print 1
```
#### Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) or AX650N DEMO Board
Open another terminal and run `run_deepseek-r1_7b_int4_ax650.sh`
```
root@ax650:~/huggingface/DeepSeek-R1-Distill-Qwen-7B# ./run_deepseek-r1_7b_int4_ax650.sh
[I][ Init][ 110]: LLM init start
[I][ Init][ 34]: connect http://127.0.0.1:12345 ok
[I][ Init][ 57]: uid: e034d25e-4fcb-4c3b-b19a-df31c278d9a8
bos_id: 151646, eos_id: 151643
3% | ██ | 1 / 31 [2.16s<67.02s, 0.46 count/s] tokenizer init ok[I][ Init][ 26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ | 31 / 31 [21.75s<21.75s, 1.43 count/s] init post axmodel ok,remain_cmm(4189 MB)[I][ Init][ 188]: max_token_len : 2047
[I][ Init][ 193]: kv_cache_size : 512, kv_cache_num: 2047
[I][ Init][ 201]: prefill_token_num : 128
[I][ Init][ 205]: grp: 1, prefill_max_token_num : 1
[I][ Init][ 205]: grp: 2, prefill_max_token_num : 128
[I][ Init][ 205]: grp: 3, prefill_max_token_num : 256
[I][ Init][ 205]: grp: 4, prefill_max_token_num : 384
[I][ Init][ 205]: grp: 5, prefill_max_token_num : 512
[I][ Init][ 205]: grp: 6, prefill_max_token_num : 640
[I][ Init][ 205]: grp: 7, prefill_max_token_num : 768
[I][ Init][ 205]: grp: 8, prefill_max_token_num : 896
[I][ Init][ 205]: grp: 9, prefill_max_token_num : 1024
[I][ Init][ 209]: prefill_max_token_num : 1024
[I][ load_config][ 282]: load config:
{
"enable_repetition_penalty": false,
"enable_temperature": true,
"enable_top_k_sampling": true,
"enable_top_p_sampling": false,
"penalty_window": 20,
"repetition_penalty": 1.2,
"temperature": 0.9,
"top_k": 10,
"top_p": 0.8
}
[I][ Init][ 218]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
[I][ GenerateKVCachePrefill][ 275]: input token num : 13, prefill_split_num : 1 prefill_grpid : 2
[I][ GenerateKVCachePrefill][ 315]: input_num_token:13
[I][ main][ 228]: precompute_len: 13
[I][ main][ 229]: system_prompt:
prompt >> 你是谁
[I][ SetKVCache][ 529]: prefill_grpid:2 kv_cache_num:128 precompute_len:13 input_num_token:6
[I][ SetKVCache][ 532]: current prefill_max_token_num:896
[I][ Run][ 658]: input token num : 6, prefill_split_num : 1
[I][ Run][ 684]: input_num_token:6
[I][ Run][ 807]: ttft: 764.85 ms
Alright, the user greeted me by saying, "You are DeepSeek. You are a helpful assistant." I need to respond in a friendly and professional manner. I should acknowledge that I'm DeepSeek, an AI assistant, and offer assistance. I'll keep it concise and welcoming.
</think>
您好!我是DeepSeek,一个由深度求索公司开发的智能助手。我随时准备为您提供帮助和解答。请问有什么可以为您服务的?
[N][ Run][ 921]: hit eos,avg 4.87 token/s
[I][ GetKVCache][ 498]: precompute_len:110, remaining:914
prompt >> q
```
|