drizzlezyk commited on
Commit
2b27c40
·
verified ·
1 Parent(s): d71aefb

Upload doc/omniinfer_for_openpangu_r_72b_2512_EN.md with huggingface_hub

Browse files
doc/omniinfer_for_openpangu_r_72b_2512_EN.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Deployment Guide for openPangu-R-72B-2512-Int8 on Omni-Infer
2
+
3
+ ## Hardware Environment and Deployment Method
4
+ PD hybrid deployment, requiring only 4 dies of one Atlas 800T A3 machine.
5
+
6
+ ## Codes and Image
7
+ - Omni-Infer code version: release_v0.7.0
8
+ - Docker Image: Refer to the v0.7.0 image in https://gitee.com/omniai/omniinfer/releases. For example, for A3 hardware and ARM architecture, use "docker pull swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm".
9
+
10
+ ## Deployment
11
+ ### 1. Launch the image
12
+ ```bash
13
+ IMAGE=swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
14
+ NAME=omniinfer-v0.7.0 # Custom docker name
15
+ NPU_NUM=16 # 16 dies of A3 node
16
+ DEVICE_ARGS=$(for i in $(seq 0 $((NPU_NUM-1))); do echo -n "--device /dev/davinci${i} "; done)
17
+
18
+ # Run the container using the defined variables
19
+ # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
20
+ # To prevent device interference from other docker containers, add the argument "--privileged"
21
+ docker run -itd \
22
+ --name=${NAME} \
23
+ --network host \
24
+ --privileged \
25
+ --ipc=host \
26
+ $DEVICE_ARGS \
27
+ --device=/dev/davinci_manager \
28
+ --device=/dev/devmm_svm \
29
+ --device=/dev/hisi_hdc \
30
+ -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
31
+ -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
32
+ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
33
+ -v /etc/ascend_install.info:/etc/ascend_install.info \
34
+ -v /mnt/:/mnt/ \
35
+ -v /data:/data \
36
+ -v /home/work:/home/work \
37
+ --entrypoint /bin/bash \
38
+ swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
39
+ ```
40
+ Ensure that the model checkpoint and the project code are accessible within the container. Enter the container:
41
+ ```bash
42
+ docker exec -it $NAME /bin/bash
43
+ ```
44
+ ### 2. Download the Omni-Infer code and add the following configuration to omniinfer/omni/models/configs/best_practice_configs.json
45
+ ```bash
46
+ git clone -b release_v0.7.0 https://gitee.com/omniai/omniinfer.git
47
+ ```
48
+ ```
49
+ {
50
+ "model": "pangu_pro_moe_v2",
51
+ "hardware": "A3",
52
+ "precision": "w8a8",
53
+ "prefill_node_num": 1,
54
+ "decode_node_num": 1,
55
+ "pd_disaggregation": false,
56
+ "prefill_config_file": "pangu_pro_moe_v2_bf16_a3_hybrid.json",
57
+ "decode_config_file": "pangu_pro_moe_v2_bf16_a3_hybrid.json"
58
+ }
59
+ ```
60
+
61
+ ### 3. Put examples/start_serving_openpangu_r_72b_2512.sh in the omniinfer/tools/scripts path and start the serving script
62
+
63
+ ```bash
64
+ cd omniinfer/tools/scripts
65
+ # You need to modify the model-path, master-ip address and PYTHONPATH in the serving script.
66
+ bash start_serving_openpangu_r_72b_2512.sh
67
+ ```
68
+
69
+ ### 4. Send Testing Requests
70
+
71
+ After the service is started, we can send testing requests.
72
+
73
+ ```bash
74
+ curl http://0.0.0.0:8000/v1/chat/completions \
75
+ -H "Content-Type: application/json" \
76
+ -d '{
77
+ "model": "openpangu_r_72b_2512",
78
+ "messages": [
79
+ {
80
+ "role": "user",
81
+ "content": "Who are you?"
82
+ }
83
+ ],
84
+ "temperature": 1.0,
85
+ "top_p": 0.8,
86
+ "top_k": -1,
87
+ "vllm_xargs": {"top_n_sigma": 0.05},
88
+ "chat_template_kwargs": {"think": true, "reasoning_effort": "low"}
89
+ }'
90
+ ```
91
+ ```bash
92
+ # tool use
93
+ curl http://0.0.0.0:8000/v1/chat/completions \
94
+ -H "Content-Type: application/json" \
95
+ -d '{
96
+ "model": "openpangu_r_72b_2512",
97
+ "messages": [
98
+ {"role": "system", "content": "你是华为公司开发的盘古模型。\n现在是2025年7月30日"},
99
+ {"role": "user", "content": "深圳明天的天气如何?"}
100
+ ],
101
+ "tools": [
102
+ {
103
+ "type": "function",
104
+ "function": {
105
+ "name": "get_current_weather",
106
+ "description": "获取指定城市的当前天气信息,包括温度、湿度、风速等数据。",
107
+ "parameters": {
108
+ "type": "object",
109
+ "properties": {
110
+ "location": {
111
+ "type": "string",
112
+ "description": "城市名称,例如:北京、深圳。支持中文或拼音输入。"
113
+ },
114
+ "date": {
115
+ "type": "string",
116
+ "description": "查询日期,格式为 YYYY-MM-DD(遵循 ISO 8601 标准)。例如:2023-10-01。"
117
+ }
118
+ },
119
+ "required": ["location", "date"],
120
+ "additionalProperties": "false"
121
+ }
122
+ }
123
+ }
124
+ ],
125
+ "temperature": 1.0,
126
+ "top_p": 0.8,
127
+ "top_k": -1,
128
+ "vllm_xargs": {"top_n_sigma": 0.05},
129
+ "chat_template_kwargs": {"think": true, "reasoning_effort": "high"}
130
+ }'
131
+ ```
132
+ The model is in slow-thinking mode by default. In slow-thinking mode, you can specify different reasoning effort by setting the "reasoning_effort" parameter in "chat_template_kwargs" to "high" or "low" to balance model accuracy and efficiency.
133
+ openPangu-R-72B-2512-Int8 supports switching between slow-thinking and fast-thinking mode by setting {"think": true/false} in "chat_template_kwargs".