openpangu
/

openPangu-R-7B-2512

+## Deployment Guide of openPangu-R-7B-2512 Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)
+### Deployment Environment Description
+The Atlas 800T A2 (64 GB) supports the deployment of openPangu-R-7B-2512.
+### A2 Image Building and Launching
+Pull the base image ：
+```
+docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11
+```
+Use Dockerfile. to build image ：
+```
+IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
+docker build -t $IMAGE -f ./Dockerfile .
+```
+Run the following command to start the docker:
+```
+export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11  # Use correct image id
+export NAME=XXX  # Custom docker name
+# Run the container using the defined variables
+# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
+# To prevent device interference from other docker containers, add the argument "--privileged"
+docker run -itd \
+--privileged \
+--ipc=host \
+--name $NAME \
+--network host \
+--device /dev/davinci0 \
+--device /dev/davinci1 \
+--device /dev/davinci2 \
+--device /dev/davinci3 \
+--device /dev/davinci4 \
+--device /dev/davinci5 \
+--device /dev/davinci6 \
+--device /dev/davinci7 \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /mnt/:/mnt/ \
+-v /data:/data \
+-v /home/work:/home/work \
+--entrypoint /bin/bash \
+$IMAGE
+```
+Ensure that the model checkpoint and the project code are accessible within the container. If not inside the container, enter the container as the root user:
+```
+docker exec -itu root $NAME /bin/bash
+cd inference
+pip install -r requirements.txt
+bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
+source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
+pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall
+```
+### openPangu-R-7B-2512 Inference
+startup script：inference/launch.sh
+openPangu-R-7B-2512 running command：
+```
+export LOAD_CKPT_DIR = XXX/checkpoint/   # The pangu_7b bf16 weight
+bash inference/launch.sh
+```
+Script example：
+```
+# Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
+# Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
+HOST=xxx.xxx.xxx.xxx
+python $SCRIPT_DIR/vllm_register.py \
+	--model $LOCAL_CKPT_DIR \
+	--served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
+	--tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
+	--trust-remote-code \
+    --host $HOST \
+	--port ${PORT:=8000} \
+	--max-num-seqs ${MAX_NUM_SEQS:=256} \
+	--max-model-len ${MAX_MODEL_LEN:=40960} \
+	--tokenizer-mode "slow" \
+	--dtype bfloat16 \
+	--enable-log-requests \
+	--distributed-executor-backend mp \
+	--gpu-memory-utilization 0.9 \
+  	--max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
+	--no-enable-prefix-caching \
+	--enforce_eager \
+	--reasoning-parser pangu \
+```
+### Send Testing Requests
+After server launched, we can send testing requests.
+```
+MASTER_NODE_IP=xxx.xxx.xxx.xxx  # server node ip
+curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "'$SERVED_MODEL_NAME'",
+        "messages": [
+            {
+                "role": "user",
+                "content": "Who are you?"
+            }
+        ],
+        "max_tokens": 512,
+        "temperature": 0
+    }'
+```