drizzlezyk commited on
Commit
bd77190
·
verified ·
1 Parent(s): 5efdd9d

Upload inference/README_EN.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. inference/README_EN.md +130 -0
inference/README_EN.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Deployment Guide of openPangu-R-7B-2512 Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
2
+
3
+ ### Deployment Environment Description
4
+
5
+ The Atlas 800T A2 (64 GB) supports the deployment of openPangu-R-7B-2512.
6
+
7
+ ### A2 Image Building and Launching
8
+
9
+ Pull the base image :
10
+
11
+ ```
12
+ docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11
13
+ ```
14
+
15
+ Use Dockerfile. to build image :
16
+
17
+ ```
18
+ IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
19
+ docker build -t $IMAGE -f ./Dockerfile .
20
+ ```
21
+
22
+ Run the following command to start the docker:
23
+
24
+ ```
25
+ export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 # Use correct image id
26
+ export NAME=XXX # Custom docker name
27
+
28
+ # Run the container using the defined variables
29
+ # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
30
+ # To prevent device interference from other docker containers, add the argument "--privileged"
31
+ docker run -itd \
32
+ --privileged \
33
+ --ipc=host \
34
+ --name $NAME \
35
+ --network host \
36
+ --device /dev/davinci0 \
37
+ --device /dev/davinci1 \
38
+ --device /dev/davinci2 \
39
+ --device /dev/davinci3 \
40
+ --device /dev/davinci4 \
41
+ --device /dev/davinci5 \
42
+ --device /dev/davinci6 \
43
+ --device /dev/davinci7 \
44
+ --device /dev/davinci_manager \
45
+ --device /dev/devmm_svm \
46
+ --device /dev/hisi_hdc \
47
+ -v /usr/local/dcmi:/usr/local/dcmi \
48
+ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
49
+ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
50
+ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
51
+ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
52
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
53
+ -v /mnt/:/mnt/ \
54
+ -v /data:/data \
55
+ -v /home/work:/home/work \
56
+ --entrypoint /bin/bash \
57
+ $IMAGE
58
+ ```
59
+
60
+ Ensure that the model checkpoint and the project code are accessible within the container. If not inside the container, enter the container as the root user:
61
+
62
+ ```
63
+ docker exec -itu root $NAME /bin/bash
64
+ cd inference
65
+ pip install -r requirements.txt
66
+ bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
67
+ source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
68
+ pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall
69
+ ```
70
+
71
+ ### openPangu-R-7B-2512 Inference
72
+
73
+ startup script:inference/launch.sh
74
+
75
+ openPangu-R-7B-2512 running command:
76
+
77
+ ```
78
+ export LOAD_CKPT_DIR = XXX/checkpoint/ # The pangu_7b bf16 weight
79
+ bash inference/launch.sh
80
+ ```
81
+
82
+ Script example:
83
+
84
+ ```
85
+ # Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
86
+ # Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
87
+ HOST=xxx.xxx.xxx.xxx
88
+
89
+ python $SCRIPT_DIR/vllm_register.py \
90
+ --model $LOCAL_CKPT_DIR \
91
+ --served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
92
+ --tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
93
+ --trust-remote-code \
94
+ --host $HOST \
95
+ --port ${PORT:=8000} \
96
+ --max-num-seqs ${MAX_NUM_SEQS:=256} \
97
+ --max-model-len ${MAX_MODEL_LEN:=40960} \
98
+ --tokenizer-mode "slow" \
99
+ --dtype bfloat16 \
100
+ --enable-log-requests \
101
+ --distributed-executor-backend mp \
102
+ --gpu-memory-utilization 0.9 \
103
+ --max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
104
+ --no-enable-prefix-caching \
105
+ --enforce_eager \
106
+ --reasoning-parser pangu \
107
+
108
+ ```
109
+
110
+ ### Send Testing Requests
111
+
112
+ After server launched, we can send testing requests.
113
+
114
+ ```
115
+ MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip
116
+ curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
117
+ -H "Content-Type: application/json" \
118
+ -d '{
119
+ "model": "'$SERVED_MODEL_NAME'",
120
+ "messages": [
121
+ {
122
+ "role": "user",
123
+ "content": "Who are you?"
124
+ }
125
+ ],
126
+ "max_tokens": 512,
127
+ "temperature": 0
128
+ }'
129
+ ```
130
+