interactSpeech / docs /source_en /BestPractices /NPU-support.md

Add files using upload-large-folder tool

f677bb1 verified 5 months ago

9.32 kB

	# NPU Support
	Author: [chuanzhubin](https://github.com/chuanzhubin)

	## Environment Preparation

	Experiment Environment: 8 * Ascend 910B3 64G (The device is provided by [@chuanzhubin](https://github.com/chuanzhubin), thanks for the support of modelscope and swift~)

	```shell
	# Create a new conda virtual environment (optional)
	conda create -n swift-npu python=3.10 -y
	conda activate swift-npu

	# Set pip global mirror (optional, to speed up downloads)
	pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
	pip install ms-swift -U

	# Install torch-npu
	pip install torch-npu decorator
	# If you want to use deepspeed (to control memory usage, training speed might decrease)
	pip install deepspeed
	```

	Check if the test environment is installed correctly and whether the NPU can be loaded properly.
	```python
	from transformers.utils import is_torch_npu_available
	import torch

	print(is_torch_npu_available()) # True
	print(torch.npu.device_count()) # 8
	print(torch.randn(10, device='npu:0'))
	```

	Check the P2P connections of the NPU, where we can see that each NPU is interconnected through 7 HCCS links with other NPUs.
	```shell
	(valle) root@valle:~/src# npu-smi info -t topo
	NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
	NPU0 X HCCS HCCS HCCS HCCS HCCS HCCS HCCS 144-167
	NPU1 HCCS X HCCS HCCS HCCS HCCS HCCS HCCS 144-167
	NPU2 HCCS HCCS X HCCS HCCS HCCS HCCS HCCS 96-119
	NPU3 HCCS HCCS HCCS X HCCS HCCS HCCS HCCS 96-119
	NPU4 HCCS HCCS HCCS HCCS X HCCS HCCS HCCS 0-23
	NPU5 HCCS HCCS HCCS HCCS HCCS X HCCS HCCS 0-23
	NPU6 HCCS HCCS HCCS HCCS HCCS HCCS X HCCS 48-71
	NPU7 HCCS HCCS HCCS HCCS HCCS HCCS HCCS X 48-71

	Legend:

	X = Self
	SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
	PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
	PIX = Path traversing a single PCIe switch
	PXB = Path traversing multiple PCIe switches
	HCCS = Connection traversing HCCS.
	NA = Unknown relationship.
	```

	Check the status of the NPU. Detailed information about the `npu-smi` command can be found in the [official documentation](https://support.huawei.com/enterprise/zh/doc/EDOC1100079287/10dcd668).
	```shell
	(valle) root@valle:~/src# npu-smi info
	+------------------------------------------------------------------------------------------------+
	\| npu-smi 24.1.rc1.b030 Version: 24.1.rc1.b030 \|
	+---------------------------+---------------+----------------------------------------------------+
	\| NPU Name \| Health \| Power(W) Temp(C) Hugepages-Usage(page)\|
	\| Chip \| Bus-Id \| AICore(%) Memory-Usage(MB) HBM-Usage(MB) \|
	+===========================+===============+====================================================+
	\| 0 910B3 \| OK \| 101.8 43 0 / 0 \|
	\| 0 \| 0000:C1:00.0 \| 0 0 / 0 3318 / 65536 \|
	+===========================+===============+====================================================+
	\| 1 910B3 \| OK \| 92.0 39 0 / 0 \|
	\| 0 \| 0000:C2:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 2 910B3 \| OK \| 102.0 40 0 / 0 \|
	\| 0 \| 0000:81:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 3 910B3 \| OK \| 99.8 40 0 / 0 \|
	\| 0 \| 0000:82:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 4 910B3 \| OK \| 98.6 45 0 / 0 \|
	\| 0 \| 0000:01:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 5 910B3 \| OK \| 99.7 44 0 / 0 \|
	\| 0 \| 0000:02:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 6 910B3 \| OK \| 103.8 45 0 / 0 \|
	\| 0 \| 0000:41:00.0 \| 0 0 / 0 3314 / 65536 \|
	+===========================+===============+====================================================+
	\| 7 910B3 \| OK \| 98.2 44 0 / 0 \|
	\| 0 \| 0000:42:00.0 \| 0 0 / 0 3315 / 65536 \|
	+===========================+===============+====================================================+
	```

	## Fine-tuning
	The following introduces the fine-tuning of LoRA. To perform full-parameter fine-tuning, simply set the parameter `--train_type full`.

	\| Model Size \| Number of NPUs \| Deepspeed Type \| Max Memory Usage \|
	\|------\|-------\|-------------\|-----------\|
	\| 7B \| 1 \| None \| 1 * 28 GB \|
	\| 7B \| 4 \| None \| 4 * 22 GB \|
	\| 7B \| 4 \| zero2 \| 4 * 28 GB \|
	\| 7B \| 4 \| zero3 \| 4 * 22 GB \|
	\| 7B \| 8 \| None \| 8 * 22 GB \|
	\| 14B \| 1 \| None \| 1 * 45 GB \|
	\| 14B \| 8 \| None \| 8 * 51 GB \|
	\| 14B \| 8 \| zero2 \| 8 * 49 GB \|
	\| 14B \| 8 \| zero3 \| 8 * 31 GB \|

	### Single Card Training

	Start single card fine-tuning with the following command: (Note: If NaN occurs during fine-tuning, please set `--torch_dtype float32`.)

	```shell
	# Experiment environment: Ascend 910B3
	# Memory requirement: 28 GB
	# Runtime: 8 hours
	ASCEND_RT_VISIBLE_DEVICES=0 \
	swift sft \
	--model Qwen/Qwen2-7B-Instruct \
	--dataset AI-ModelScope/blossom-math-v2 \
	--num_train_epochs 5 \
	--train_type lora \
	--output_dir output \
	--learning_rate 1e-4 \
	--gradient_accumulation_steps 16 \
	--save_steps 100 \
	--eval_steps 100
	```

	### Data Parallel Training
	We use 4 cards for DDP training.

	```shell
	# Experiment environment: 4 * Ascend 910B3
	# Memory requirement: 4 * 22 GB
	# Runtime: 2 hours
	NPROC_PER_NODE=4 \
	ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
	swift sft \
	--model Qwen/Qwen2-7B-Instruct \
	--dataset AI-ModelScope/blossom-math-v2 \
	--num_train_epochs 5 \
	--train_type lora \
	--output_dir output \
	...
	```

	### Deepspeed Training

	ZeRO2:
	```shell
	# Experiment environment: 4 * Ascend 910B3
	# Memory requirement: 4 * 28GB
	# Runtime: 3.5 hours
	NPROC_PER_NODE=4 \
	ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
	swift sft \
	--model Qwen/Qwen2-7B-Instruct \
	--dataset AI-ModelScope/blossom-math-v2 \
	--num_train_epochs 5 \
	--train_type lora \
	--output_dir output \
	--deepspeed zero2 \
	...
	```

	ZeRO3:
	```shell
	# Experiment environment: 4 * Ascend 910B3
	# Memory requirement: 4 * 22 GB
	# Runtime: 8.5 hours
	NPROC_PER_NODE=4 \
	ASCEND_RT_VISIBLE_DEVICES=0,1,2,3 \
	swift sft \
	--model Qwen/Qwen2-7B-Instruct \
	--dataset AI-ModelScope/blossom-math-v2 \
	--num_train_epochs 5 \
	--train_type lora \
	--output_dir output \
	--deepspeed zero3 \
	...
	```

	## Inference

	Original Model:
	```shell
	ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
	--model Qwen/Qwen2-7B-Instruct \
	--stream true --max_new_tokens 2048
	```

	After LoRA Fine-tuning:
	```shell
	ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
	--adapters xxx/checkpoint-xxx --load_data_args true \
	--stream true --max_new_tokens 2048

	# Merge LoRA and infer
	ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true

	ASCEND_RT_VISIBLE_DEVICES=0 swift infer \
	--model xxx/checkpoint-xxx-merged --load_data_args true \
	--stream true --max_new_tokens 2048
	```

	## Deployment
	NPUs do not support using vllm for inference/acceleration during deployment, but can be deployed using native PyTorch.

	Original Model:
	```shell
	ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model Qwen/Qwen2-7B-Instruct --max_new_tokens 2048
	```

	After LoRA Fine-tuning:
	```shell
	ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --adapters xxx/checkpoint-xxx --max_new_tokens 2048

	# Merge LoRA and deploy
	ASCEND_RT_VISIBLE_DEVICES=0 swift export --adapters xx/checkpoint-xxx --merge_lora true
	ASCEND_RT_VISIBLE_DEVICES=0 swift deploy --model xxx/checkpoint-xxx-merged --max_new_tokens 2048
	```