Upload doc/omniinfer_for_openpangu_r_72b_2512_EN.md with huggingface_hub

Browse files

Files changed (1) hide show

doc/omniinfer_for_openpangu_r_72b_2512_EN.md +133 -0

doc/omniinfer_for_openpangu_r_72b_2512_EN.md ADDED Viewed

	@@ -0,0 +1,133 @@

+# Deployment Guide for openPangu-R-72B-2512-Int8 on Omni-Infer
+## Hardware Environment and Deployment Method
+PD hybrid deployment, requiring only 4 dies of one Atlas 800T A3 machine.
+## Codes and Image
+- Omni-Infer code version: release_v0.7.0
+- Docker Image: Refer to the v0.7.0 image in https://gitee.com/omniai/omniinfer/releases. For example, for A3 hardware and ARM architecture, use "docker pull swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm".
+## Deployment
+### 1. Launch the image
+```bash
+IMAGE=swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
+NAME=omniinfer-v0.7.0  # Custom docker name
+NPU_NUM=16  # 16 dies of A3 node
+DEVICE_ARGS=$(for i in $(seq 0 $((NPU_NUM-1))); do echo -n "--device /dev/davinci${i} "; done)
+# Run the container using the defined variables
+# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
+# To prevent device interference from other docker containers, add the argument "--privileged"
+docker run -itd \
+  --name=${NAME} \
+  --network host \
+  --privileged \
+  --ipc=host \
+  $DEVICE_ARGS \
+  --device=/dev/davinci_manager \
+  --device=/dev/devmm_svm \
+  --device=/dev/hisi_hdc \
+  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
+  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
+  -v /etc/ascend_install.info:/etc/ascend_install.info \
+  -v /mnt/:/mnt/ \
+  -v /data:/data \
+  -v /home/work:/home/work \
+  --entrypoint /bin/bash \
+  swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
+```
+Ensure that the model checkpoint and the project code are accessible within the container. Enter the container:
+```bash
+docker exec -it $NAME /bin/bash
+```
+### 2. Download the Omni-Infer code and add the following configuration to omniinfer/omni/models/configs/best_practice_configs.json
+```bash
+git clone -b release_v0.7.0 https://gitee.com/omniai/omniinfer.git
+```
+```
+{
+      "model": "pangu_pro_moe_v2",
+      "hardware": "A3",
+      "precision": "w8a8",
+      "prefill_node_num": 1,
+      "decode_node_num": 1,
+      "pd_disaggregation": false,
+      "prefill_config_file": "pangu_pro_moe_v2_bf16_a3_hybrid.json",
+      "decode_config_file": "pangu_pro_moe_v2_bf16_a3_hybrid.json"
+}
+```
+### 3. Put examples/start_serving_openpangu_r_72b_2512.sh in the omniinfer/tools/scripts path and start the serving script
+```bash
+cd omniinfer/tools/scripts
+# You need to modify the model-path, master-ip address and PYTHONPATH in the serving script.
+bash start_serving_openpangu_r_72b_2512.sh
+```
+### 4. Send Testing Requests
+After the service is started, we can send testing requests.
+```bash
+curl http://0.0.0.0:8000/v1/chat/completions \
+    -H "Content-Type: application/json" \
+    -d '{
+        "model": "openpangu_r_72b_2512",
+        "messages": [
+            {
+                "role": "user",
+                "content": "Who are you?"
+            }
+        ],
+        "temperature": 1.0,
+        "top_p": 0.8,
+        "top_k": -1,
+        "vllm_xargs": {"top_n_sigma": 0.05},
+        "chat_template_kwargs": {"think": true, "reasoning_effort": "low"}
+    }'
+```
+```bash
+# tool use
+curl http://0.0.0.0:8000/v1/chat/completions \
+   -H "Content-Type: application/json" \
+   -d '{
+        "model": "openpangu_r_72b_2512",
+        "messages": [
+            {"role": "system", "content": "你是华为公司开发的盘古模型。\n现在是2025年7月30日"},
+            {"role": "user", "content": "深圳明天的天气如何？"}
+        ],
+        "tools": [
+                    {
+                        "type": "function",
+                        "function": {
+                            "name": "get_current_weather",
+                            "description": "获取指定城市的当前天气信息，包括温度、湿度、风速等数据。",
+                            "parameters": {
+                                "type": "object",
+                                "properties": {
+                                    "location": {
+                                        "type": "string",
+                                        "description": "城市名称，例如：北京、深圳。支持中文或拼音输入。"
+                                    },
+                                    "date": {
+                                        "type": "string",
+                                        "description": "查询日期，格式为 YYYY-MM-DD（遵循 ISO 8601 标准）。例如：2023-10-01。"
+                                    }
+                                },
+                                "required": ["location", "date"],
+                                "additionalProperties": "false"
+                            }
+                        }
+                    }
+                ],
+        "temperature": 1.0,
+        "top_p": 0.8,
+        "top_k": -1,
+        "vllm_xargs": {"top_n_sigma": 0.05},
+        "chat_template_kwargs": {"think": true, "reasoning_effort": "high"}
+    }'
+```
+The model is in slow-thinking mode by default. In slow-thinking mode, you can specify different reasoning effort by setting the "reasoning_effort" parameter in "chat_template_kwargs" to "high" or "low" to balance model accuracy and efficiency.
+openPangu-R-72B-2512-Int8 supports switching between slow-thinking and fast-thinking mode by setting {"think": true/false} in "chat_template_kwargs".