yuccaaa commited on
Commit
6804d1c
·
verified ·
1 Parent(s): 4644064

Upload ms-swift/docs/source_en/BestPractices/NPU-support.md with huggingface_hub

Browse files
ms-swift/docs/source_en/BestPractices/NPU-support.md ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NPU Support
2
+ Author: [chuanzhubin](https://github.com/chuanzhubin)
3
+
4
+ ## Environment Preparation
5
+
6
+ Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
7
+
8
+ ```shell
9
+ # Create a new conda virtual environment (optional)
10
+ conda create -n swift-npu python=3.10 -y
11
+ conda activate swift-npu
12
+
13
+ # Set pip global mirror (optional, to speed up downloads)
14
+ pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
15
+ pip install ms-swift -U
16
+
17
+ # Install torch-npu
18
+ pip install torch-npu decorator
19
+ # If you want to use deepspeed (to control memory usage, training speed might decrease)
20
+ pip install deepspeed
21
+ ```
22
+
23
+ Check if the test environment is installed correctly and whether the NPU can be loaded properly.
24
+ ```python
25
+ from transformers.utils import is_torch_npu_available
26
+ import torch
27
+
28
+ print(is_torch_npu_available()) # True
29
+ print(torch.npu.device_count()) # 8
30
+ print(torch.randn(10, device='npu:0'))
31
+ ```
32
+
33
+ Check the P2P connections of the NPU, where we can see that each NPU is interconnected through 7 HCCS links with other NPUs.
34
+ ```shell
35
+ (valle) root@valle:~/src# npu-smi info -t topo
36
+ NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
37
+ NPU0 X HCCS HCCS HCCS HCCS HCCS HCCS HCCS 144-167
38
+ NPU1 HCCS X HCCS HCCS HCCS HCCS HCCS HCCS 144-167
39
+ NPU2 HCCS HCCS X HCCS HCCS HCCS HCCS HCCS 96-119
40
+ NPU3 HCCS HCCS HCCS X HCCS HCCS HCCS HCCS 96-119
41
+ NPU4 HCCS HCCS HCCS HCCS X HCCS HCCS HCCS 0-23
42
+ NPU5 HCCS HCCS HCCS HCCS HCCS X HCCS HCCS 0-23
43
+ NPU6 HCCS HCCS HCCS HCCS HCCS HCCS X HCCS 48-71
44
+ NPU7 HCCS HCCS HCCS HCCS HCCS HCCS HCCS X 48-71
45
+
46
+ Legend:
47
+
48
+ X = Self
49
+ SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
50
+ PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
51
+ PIX = Path traversing a single PCIe switch
52
+ PXB = Path traversing multiple PCIe switches
53
+ HCCS = Connection traversing HCCS.
54
+ NA = Unknown relationship.
55
+ ```
56
+
57
+ Check the status of the NPU. Detailed information about the `npu-smi` command can be found in the [official documentation](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668).
58
+ ```shell
59
+ (valle) root@valle:~/src# npu-smi info
60
+ +------------------------------------------------------------------------------------------------+
61
+ | npu-smi 24.1.rc1.b030 Version: 24.1.rc1.b030 |
62
+ +---------------------------+---------------+----------------------------------------------------+
63
+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
64
+ | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
65
+ +===========================+===============+====================================================+
66
+ | 0 910B3 | OK | 101.8 43 0 / 0 |
67
+ | 0 | 0000:C1:00.0 | 0 0 / 0 3318 / 65536 |
68
+ +===========================+===============+====================================================+
69
+ | 1 910B3 | OK | 92.0 39 0 / 0 |
70
+ | 0 | 0000:C2:00.0 | 0 0 / 0 3314 / 65536 |
71
+ +===========================+===============+====================================================+
72
+ | 2 910B3 | OK | 102.0 40 0 / 0 |
73
+ | 0 | 0000:81:00.0 | 0 0 / 0 3314 / 65536 |
74
+ +===========================+===============+====================================================+
75
+ | 3 910B3 | OK | 99.8 40 0 / 0 |
76
+ | 0 | 0000:82:00.0 | 0 0 / 0 3314 / 65536 |
77
+ +===========================+===============+====================================================+
78
+ | 4 910B3 | OK | 98.6 45 0 / 0 |
79
+ | 0 | 0000:01:00.0 | 0 0 / 0 3314 / 65536 |
80
+ +===========================+===============+====================================================+
81
+ | 5 910B3 | OK | 99.7 44 0 / 0 |
82
+ | 0 | 0000:02:00.0 | 0 0 / 0 3314 / 65536 |
83
+ +===========================+===============+====================================================+
84
+ | 6 910B3 | OK | 103.8 45 0 / 0 |
85
+ | 0 | 0000:41:00.0 | 0 0 / 0 3314 / 65536 |
86
+ +===========================+===============+====================================================+
87
+ | 7 910B3 | OK | 98.2 44 0 / 0 |
88
+ | 0 | 0000:42:00.0 | 0 0 / 0 3315 / 65536 |
89
+ +===========================+===============+====================================================+
90
+ ```
91
+
92
+ ## Fine-tuning
93
+ The following introduces the fine-tuning of LoRA. To perform full-parameter fine-tuning, simply set the parameter `--train_type full`.
94
+
95
+ | Model Size | Number of NPUs | Deepspeed Type | Max Memory Usage |
96
+ |------|-------|-------------|-----------|
97
+ | 7B | 1 | None | 1 * 28 GB |
98
+ | 7B | 4 | None | 4 * 22 GB |
99
+ | 7B | 4 | zero2 | 4 * 28 GB |
100
+ | 7B | 4 | zero3 | 4 * 22 GB |
101
+ | 7B | 8 | None | 8 * 22 GB |
102
+ | 14B | 1 | None | 1 * 45 GB |
103
+ | 14B | 8 | None | 8 * 51 GB |
104
+ | 14B | 8 | zero2 | 8 * 49 GB |
105
+ | 14B | 8 | zero3 | 8 * 31 GB |
106
+
107
+ ### Single Card Training
108
+
109
+ Start single card fine-tuning with the following command: (Note: If NaN occurs during fine-tuning, please set `--torch_dtype float32`.)
110
+
111
+ ```shell
112
+ # Experiment environment: Ascend 910B3
113
+ # Memory requirement: 28 GB
114
+ # Runtime: 8 hours
115
+ ASCEND_RT_VISIBLE_DEVICES=0 \
116
+ swift sft \
117
+ --model Qwen/Qwen2-7B-Instruct \
118
+ --dataset AI-ModelScope/blossom-math-v2 \
119
+ --num_train_epochs 5 \
120
+ --train_type lora \
121
+ --output_dir output \
122
+ --learning_rate 1e-4 \
123
+ --gradient_accumulation_steps 16 \
124
+ --save_steps 100 \
125
+ --eval_steps 100
126
+ ```
127
+
128
+ ### Data Parallel Training
129
+ We use 4 cards for DDP training.
130
+
131
+ ```shell
132
+ # Experiment environment: 4 * Ascend 910B3
133
+ # Memory requirement: 4 * 22 GB
134
+ # Runtime: 2 hours
135
+ NPROC_PER_NODE=4 \
136
+ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
137
+ swift sft \
138
+ --model Qwen/Qwen2-7B-Instruct \
139
+ --dataset AI-ModelScope/blossom-math-v2 \
140
+ --num_train_epochs 5 \
141
+ --train_type lora \
142
+ --output_dir output \
143
+ ...
144
+ ```
145
+
146
+ ### Deepspeed Training
147
+
148
+ ZeRO2:
149
+ ```shell
150
+ # Experiment environment: 4 * Ascend 910B3
151
+ # Memory requirement: 4 * 28GB
152
+ # Runtime: 3.5 hours
153
+ NPROC_PER_NODE=4 \
154
+ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
155
+ swift sft \
156
+ --model Qwen/Qwen2-7B-Instruct \
157
+ --dataset AI-ModelScope/blossom-math-v2 \
158
+ --num_train_epochs 5 \
159
+ --train_type lora \
160
+ --output_dir output \
161
+ --deepspeed zero2 \
162
+ ...
163
+ ```
164
+
165
+ ZeRO3:
166
+ ```shell
167
+ # Experiment environment: 4 * Ascend 910B3
168
+ # Memory requirement: 4 * 22 GB
169
+ # Runtime: 8.5 hours
170
+ NPROC_PER_NODE=4 \
171
+ ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
172
+ swift sft \
173
+ --model Qwen/Qwen2-7B-Instruct \
174
+ --dataset AI-ModelScope/blossom-math-v2 \
175
+ --num_train_epochs 5 \
176
+ --train_type lora \
177
+ --output_dir output \
178
+ --deepspeed zero3 \
179
+ ...
180
+ ```
181
+
182
+ ## Inference
183
+
184
+ Original Model:
185
+ ```shell
186
+ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
187
+ --model Qwen/Qwen2-7B-Instruct \
188
+ --stream true --max_new_tokens 2048
189
+ ```
190
+
191
+ After LoRA Fine-tuning:
192
+ ```shell
193
+ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
194
+ --adapters xxx/checkpoint-xxx --load_data_args true \
195
+ --stream true --max_new_tokens 2048
196
+
197
+ # Merge LoRA and infer
198
+ ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
199
+
200
+ ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
201
+ --model xxx/checkpoint-xxx-merged --load_data_args true \
202
+ --stream true --max_new_tokens 2048
203
+ ```
204
+
205
+ ## Deployment
206
+ NPUs do not support using vllm for inference/acceleration during deployment, but can be deployed using native PyTorch.
207
+
208
+ Original Model:
209
+ ```shell
210
+ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
211
+ ```
212
+
213
+ After LoRA Fine-tuning:
214
+ ```shell
215
+ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
216
+
217
+ # Merge LoRA and deploy
218
+ ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
219
+ ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
220
+ ```