Upload inference/README_EN.md with huggingface_hub

bd77190 verified about 2 months ago

4.12 kB

	## Deployment Guide of openPangu-R-7B-2512 Based on [vllm-ascend](https://github.com/vllm-project/vllm-ascend)

	### Deployment Environment Description

	The Atlas 800T A2 (64 GB) supports the deployment of openPangu-R-7B-2512.

	### A2 Image Building and Launching

	Pull the base image ：

	```
	docker pull quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11
	```

	Use Dockerfile. to build image ：

	```
	IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11
	docker build -t $IMAGE -f ./Dockerfile .
	```

	Run the following command to start the docker:

	```
	export IMAGE=quay.io/ascend/cann:8.3.rc1.alpha003-910b-ubuntu22.04-py3.11-vllm0.11 # Use correct image id
	export NAME=XXX # Custom docker name

	# Run the container using the defined variables
	# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
	# To prevent device interference from other docker containers, add the argument "--privileged"
	docker run -itd \
	--privileged \
	--ipc=host \
	--name $NAME \
	--network host \
	--device /dev/davinci0 \
	--device /dev/davinci1 \
	--device /dev/davinci2 \
	--device /dev/davinci3 \
	--device /dev/davinci4 \
	--device /dev/davinci5 \
	--device /dev/davinci6 \
	--device /dev/davinci7 \
	--device /dev/davinci_manager \
	--device /dev/devmm_svm \
	--device /dev/hisi_hdc \
	-v /usr/local/dcmi:/usr/local/dcmi \
	-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
	-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
	-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
	-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
	-v /etc/ascend_install.info:/etc/ascend_install.info \
	-v /mnt/:/mnt/ \
	-v /data:/data \
	-v /home/work:/home/work \
	--entrypoint /bin/bash \
	$IMAGE
	```

	Ensure that the model checkpoint and the project code are accessible within the container. If not inside the container, enter the container as the root user:

	```
	docker exec -itu root $NAME /bin/bash
	cd inference
	pip install -r requirements.txt
	bash ./cann910B-omni_inference_custom_ops-0.7.0-8.3.RC1-linux-aarch64.run --install-path=/usr/local/Ascend/ascend-toolkit/latest/opp
	source /usr/local/Ascend/ascend-toolkit/latest/opp/vendors/omni_custom_ops/bin/set_env.bash
	pip install omni_inference_ascendc_custom_ops-0.7.0+8.3.rc1.pta2.7.1-cp311-cp311-linux_aarch64.whl --force-reinstall
	```

	### openPangu-R-7B-2512 Inference

	startup script：inference/launch.sh

	openPangu-R-7B-2512 running command：

	```
	export LOAD_CKPT_DIR = XXX/checkpoint/ # The pangu_7b bf16 weight
	bash inference/launch.sh
	```

	Script example：

	```
	# Specifying HOST=127.0.0.1 (localhost) means the server can only be accessed from the master device.
	# Specifying HOST=0.0.0.0 allows the vLLM server to be accessed from other devices on the same network or even from the internet, provided proper network configuration (e.g., firewall rules, port forwarding) is in place.
	HOST=xxx.xxx.xxx.xxx

	python $SCRIPT_DIR/vllm_register.py \
	--model $LOCAL_CKPT_DIR \
	--served-model-name ${SERVED_MODEL_NAME:=pangu_7b} \
	--tensor-parallel-size ${TENSOR_PARALLEL_SIZE:=8} \
	--trust-remote-code \
	--host $HOST \
	--port ${PORT:=8000} \
	--max-num-seqs ${MAX_NUM_SEQS:=256} \
	--max-model-len ${MAX_MODEL_LEN:=40960} \
	--tokenizer-mode "slow" \
	--dtype bfloat16 \
	--enable-log-requests \
	--distributed-executor-backend mp \
	--gpu-memory-utilization 0.9 \
	--max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS:=4096} \
	--no-enable-prefix-caching \
	--enforce_eager \
	--reasoning-parser pangu \

	```

	### Send Testing Requests

	After server launched, we can send testing requests.

	```
	MASTER_NODE_IP=xxx.xxx.xxx.xxx # server node ip
	curl http://${MASTER_NODE_IP}:${PORT}/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "'$SERVED_MODEL_NAME'",
	"messages": [
	{
	"role": "user",
	"content": "Who are you?"
	}
	],
	"max_tokens": 512,
	"temperature": 0
	}'
	```