Upload ms-swift/docs/source_en/BestPractices/NPU-support.md with huggingface_hub

Browse files

Files changed (1) hide show

ms-swift/docs/source_en/BestPractices/NPU-support.md +220 -0

ms-swift/docs/source_en/BestPractices/NPU-support.md ADDED Viewed

	@@ -0,0 +1,220 @@

+# NPU Support
+Author: [chuanzhubin](https://github.com/chuanzhubin)
+## Environment Preparation
+Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)
+```shell
+# Create a new conda virtual environment (optional)
+conda create -n swift-npu python=3.10 -y
+conda activate swift-npu
+# Set pip global mirror (optional, to speed up downloads)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+pip install ms-swift -U
+# Install torch-npu
+pip install torch-npu decorator
+# If you want to use deepspeed (to control memory usage, training speed might decrease)
+pip install deepspeed
+```
+Check if the test environment is installed correctly and whether the NPU can be loaded properly.
+```python
+from transformers.utils import is_torch_npu_available
+import torch
+print(is_torch_npu_available())  # True
+print(torch.npu.device_count())  # 8
+print(torch.randn(10, device='npu:0'))
+```
+Check the P2P connections of the NPU, where we can see that each NPU is interconnected through 7 HCCS links with other NPUs.
+```shell
+(valle) root@valle:~/src# npu-smi info -t topo
+       NPU0       NPU1       NPU2       NPU3       NPU4       NPU5       NPU6       NPU7       CPU Affinity
+NPU0       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       144-167
+NPU1       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       144-167
+NPU2       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       HCCS       96-119
+NPU3       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       HCCS       96-119
+NPU4       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       HCCS       0-23
+NPU5       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       HCCS       0-23
+NPU6       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          HCCS       48-71
+NPU7       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       HCCS       X          48-71
+Legend:
+  X    = Self
+  SYS  = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
+  PHB  = Path traversing PCIe and the PCIe host bridge of a CPU.
+  PIX  = Path traversing a single PCIe switch
+  PXB  = Path traversing multiple PCIe switches
+  HCCS = Connection traversing HCCS.
+  NA   = Unknown relationship.
+```
+Check the status of the NPU. Detailed information about the `npu-smi` command can be found in the [official documentation](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668).
+```shell
+(valle) root@valle:~/src# npu-smi info
++------------------------------------------------------------------------------------------------+
+| npu-smi 24.1.rc1.b030            Version: 24.1.rc1.b030                                        |
++---------------------------+---------------+----------------------------------------------------+
+| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
+| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
++===========================+===============+====================================================+
+| 0     910B3               | OK            | 101.8       43                0    / 0             |
+| 0                         | 0000:C1:00.0  | 0           0    / 0          3318 / 65536         |
++===========================+===============+====================================================+
+| 1     910B3               | OK            | 92.0        39                0    / 0             |
+| 0                         | 0000:C2:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 2     910B3               | OK            | 102.0       40                0    / 0             |
+| 0                         | 0000:81:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 3     910B3               | OK            | 99.8        40                0    / 0             |
+| 0                         | 0000:82:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 4     910B3               | OK            | 98.6        45                0    / 0             |
+| 0                         | 0000:01:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 5     910B3               | OK            | 99.7        44                0    / 0             |
+| 0                         | 0000:02:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 6     910B3               | OK            | 103.8       45                0    / 0             |
+| 0                         | 0000:41:00.0  | 0           0    / 0          3314 / 65536         |
++===========================+===============+====================================================+
+| 7     910B3               | OK            | 98.2        44                0    / 0             |
+| 0                         | 0000:42:00.0  | 0           0    / 0          3315 / 65536         |
++===========================+===============+====================================================+
+```
+## Fine-tuning
+The following introduces the fine-tuning of LoRA. To perform full-parameter fine-tuning, simply set the parameter `--train_type full`.
+| Model Size | Number of NPUs | Deepspeed Type | Max Memory Usage   |
+|------|-------|-------------|-----------|
+| 7B   | 1     | None        | 1 * 28 GB |
+| 7B   | 4     | None        | 4 * 22 GB |
+| 7B   | 4     | zero2       | 4 * 28 GB |
+| 7B   | 4     | zero3       | 4 * 22 GB |
+| 7B   | 8     | None        | 8 * 22 GB |
+| 14B  | 1     | None        | 1 * 45 GB |
+| 14B  | 8     | None        | 8 * 51 GB |
+| 14B  | 8     | zero2       | 8 * 49 GB |
+| 14B  | 8     | zero3       | 8 * 31 GB |
+### Single Card Training
+Start single card fine-tuning with the following command: (Note: If NaN occurs during fine-tuning, please set `--torch_dtype float32`.)
+```shell
+# Experiment environment: Ascend 910B3
+# Memory requirement: 28 GB
+# Runtime: 8 hours
+ASCEND_RT_VISIBLE_DEVICES=0 \
+swift sft \
+    --model Qwen/Qwen2-7B-Instruct \
+    --dataset AI-ModelScope/blossom-math-v2 \
+    --num_train_epochs 5 \
+    --train_type lora \
+    --output_dir output \
+    --learning_rate 1e-4 \
+    --gradient_accumulation_steps 16 \
+    --save_steps 100 \
+    --eval_steps 100
+```
+### Data Parallel Training
+We use 4 cards for DDP training.
+```shell
+# Experiment environment: 4 * Ascend 910B3
+# Memory requirement: 4 * 22 GB
+# Runtime: 2 hours
+NPROC_PER_NODE=4 \
+ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model Qwen/Qwen2-7B-Instruct \
+    --dataset AI-ModelScope/blossom-math-v2 \
+    --num_train_epochs 5 \
+    --train_type lora \
+    --output_dir output \
+    ...
+```
+### Deepspeed Training
+ZeRO2:
+```shell
+# Experiment environment: 4 * Ascend 910B3
+# Memory requirement: 4 * 28GB
+# Runtime: 3.5 hours
+NPROC_PER_NODE=4 \
+ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model Qwen/Qwen2-7B-Instruct \
+    --dataset AI-ModelScope/blossom-math-v2 \
+    --num_train_epochs 5 \
+    --train_type lora \
+    --output_dir output \
+    --deepspeed zero2 \
+    ...
+```
+ZeRO3:
+```shell
+# Experiment environment: 4 * Ascend 910B3
+# Memory requirement: 4 * 22 GB
+# Runtime: 8.5 hours
+NPROC_PER_NODE=4 \
+ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
+swift sft \
+    --model Qwen/Qwen2-7B-Instruct \
+    --dataset AI-ModelScope/blossom-math-v2 \
+    --num_train_epochs 5 \
+    --train_type lora \
+    --output_dir output \
+    --deepspeed zero3 \
+    ...
+```
+## Inference
+Original Model:
+```shell
+ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
+    --model Qwen/Qwen2-7B-Instruct \
+    --stream true --max_new_tokens 2048
+```
+After LoRA Fine-tuning:
+```shell
+ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
+    --adapters xxx/checkpoint-xxx --load_data_args true \
+    --stream true --max_new_tokens 2048
+# Merge LoRA and infer
+ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
+ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
+    --model xxx/checkpoint-xxx-merged --load_data_args true \
+    --stream true --max_new_tokens 2048
+```
+## Deployment
+NPUs do not support using vllm for inference/acceleration during deployment, but can be deployed using native PyTorch.
+Original Model:
+```shell
+ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
+```
+After LoRA Fine-tuning:
+```shell
+ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048
+# Merge LoRA and deploy
+ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
+ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
+```