Upload folder using huggingface_hub

7134ce7 verified about 2 months ago

12.9 kB


	# 快速开始

	ms-swift引入了Megatron的并行技术来加速大模型的训练，包括数据并行、张量并行、流水线并行、序列并行，上下文并行，专家并行。支持Qwen3、[Qwen3-MoE](https://github.com/modelscope/ms-swift/blob/main/examples/megatron/mcore_bridge/full/moe.sh)、Qwen2.5、Llama3、Deepseek-R1、GLM4.5等模型的CPT/SFT/DPO/GRPO。完整支持的模型可以参考[支持的模型与数据集文档](../Instruction/Supported-models-and-datasets.md)。推荐在MoE训练时使用Megatron-SWIFT，这通常可以获得10倍的训练速度提升。


	\| 方法 \| 全参数 \| LoRA \| MoE \| 多模态 \| FP8 \|
	\| ------ \| ------ \| ---- \| ----- \| ----- \| ----- \|
	\| 预训练 \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [指令监督微调](https://github.com/modelscope/ms-swift/tree/main/examples/megatron) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [GRPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/grpo) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [GKD](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/gkd) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [DPO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/dpo) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [KTO](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/kto) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [RM](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/rlhf/rm) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [Embedding](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/embedding) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [Reranker](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/reranker) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|
	\| [序列分类](https://github.com/modelscope/ms-swift/tree/main/examples/megatron/seq_cls) \| ✅ \| ✅\| ✅ \| ✅ \| ✅ \|


	## 环境准备
	使用Megatron-SWIFT，除了安装swift依赖外，还需要安装以下内容：

	```shell
	pip install pybind11

	# transformer_engine
	# 若出现安装错误，可以参考该issue解决: https://github.com/modelscope/ms-swift/issues/3793
	pip install --no-build-isolation transformer_engine[pytorch]

	# apex
	# 提示：Megatron-SWIFT可以在不含apex的环境下运行，额外设置`--no_gradient_accumulation_fusion true`即可。
	git clone https://github.com/NVIDIA/apex
	cd apex
	pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

	# megatron-core
	pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.15.0

	# 若使用多机训练，请额外设置`MODELSCOPE_CACHE`环境变量为共享存储路径
	# 这将确保数据集缓存共享，而加速预处理速度。
	# 注意：这步很关键，不然多机训练可能因随机性问题导致数据不一致而训练卡住。
	export MODELSCOPE_CACHE='/xxx/shared'

	# Megatron-LM
	# 依赖库Megatron-LM中的训练模块将由swift进行git clone并安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径（断网环境，[core_r0.15.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.15.0)）。
	git clone --branch core_r0.15.0 https://github.com/NVIDIA/Megatron-LM.git
	export MEGATRON_LM_PATH='/xxx/Megatron-LM'

	# flash_attn
	# 选择合适的版本进行安装：https://github.com/Dao-AILab/flash-attention/releases/tag/v2.8.3
	# 注意：请勿安装高于transformer_engine限制的最高版本：https://github.com/NVIDIA/TransformerEngine/blob/release_v2.10/transformer_engine/pytorch/attention/dot_product_attention/utils.py#L118
	MAX_JOBS=8 pip install "flash-attn==2.8.3" --no-build-isolation
	```

	或者你也可以使用镜像：（历史镜像查看[这里](../GetStarted/SWIFT-installation.md#镜像)）
	```
	modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.9.0-vllm0.13.0-modelscope1.33.0-swift3.12.3
	modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.9.0-vllm0.13.0-modelscope1.33.0-swift3.12.3
	modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.9.0-vllm0.13.0-modelscope1.33.0-swift3.12.3

	# cu129 (fp8 training)
	modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
	modelscope-registry.cn-beijing.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
	modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.9.1-py311-torch2.8.0-vllm0.11.0-modelscope1.32.0-swift3.11.3
	```

	推荐运行环境：
	\| \| 范围 \| 推荐 \| 备注 \|
	\|--------------\|--------------\|-------------\|--------------------\|
	\| python \| >=3.9 \| 3.10/3.11 \| \|
	\| cuda \| \| cuda12 \| \|
	\| torch \| >=2.0 \| 2.8.0 \| \|
	\| transformer_engine \| >=2.3 \| 2.10.0 \| \|
	\| apex \| \| 0.1 \| \|
	\| megatron_core \| >=0.12,<0.16 \| 0.15 \| \|
	\| flash_attn \| \| 2.8.3/3.0.0b1 \| \|
	\| transformers \| >=4.33 \| 4.57.6 \| \|
	\| modelscope \| >=1.23 \| \| \|
	\| peft \| >=0.11,<0.19 \| \| LoRA \|
	\| trl \| >=0.15,<0.25 \| \| RLHF \|


	## 快速入门案例

	这里介绍使用2卡80GiB A100对Qwen2.5-7B-Instruct模型进行自我认知微调的快速入门案例，以下最佳实践可以在10分钟内完成。

	### 传统方式

	首先，我们需要将HF格式的权重转为Megatron格式：
	- 多卡权重转换：将`CUDA_VISIBLE_DEVICES=0`删除即可使用多卡权重转换。
	- 转换精度测试：`--test_convert_precision true`将测试转换精度。在MoE大型模型的转换时，该参数所需时间较长，且需要更多的内存消耗，可酌情去除。

	```shell
	CUDA_VISIBLE_DEVICES=0 \
	swift export \
	--model Qwen/Qwen2.5-7B-Instruct \
	--to_mcore true \
	--torch_dtype bfloat16 \
	--output_dir Qwen2.5-7B-Instruct-mcore \
	--test_convert_precision true
	```

	然后，使用以下脚本进行训练，训练所需显存资源为2*80GiB：
	- 若使用多机训练，建议共享磁盘，并将`--save`指定为相同的路径。
	```shell
	PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
	NPROC_PER_NODE=2 \
	CUDA_VISIBLE_DEVICES=0,1 \
	megatron sft \
	--load Qwen2.5-7B-Instruct-mcore \
	--save_safetensors false \
	--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
	'AI-ModelScope/alpaca-gpt4-data-en#500' \
	'swift/self-cognition#500' \
	--tensor_model_parallel_size 2 \
	--sequence_parallel true \
	--micro_batch_size 16 \
	--global_batch_size 16 \
	--recompute_granularity full \
	--recompute_method uniform \
	--recompute_num_layers 1 \
	--finetune true \
	--cross_entropy_loss_fusion true \
	--lr 1e-5 \
	--lr_warmup_fraction 0.05 \
	--min_lr 1e-6 \
	--max_epochs 1 \
	--save megatron_output/Qwen2.5-7B-Instruct \
	--save_interval 100 \
	--max_length 2048 \
	--system 'You are a helpful assistant.' \
	--num_workers 4 \
	--no_save_optim true \
	--no_save_rng true \
	--dataset_num_proc 4 \
	--model_author swift \
	--model_name swift-robot
	```

	最后，将Megatron格式权重转为HF格式：
	- 注意：`--mcore_model`请指向`iter_xxx`的上级目录。默认会使用`latest_checkpointed_iteration.txt`中对应的checkpoint。
	- 若出现OOM，将`CUDA_VISIBLE_DEVICES=0`删除。若出现内存不足，请将`--test_convert_precision true`删除。
	```shell
	CUDA_VISIBLE_DEVICES=0 \
	swift export \
	--mcore_model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
	--to_hf true \
	--torch_dtype bfloat16 \
	--output_dir megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
	--test_convert_precision true
	```

	我们对生成的HF格式权重进行推理：
	```shell
	CUDA_VISIBLE_DEVICES=0 \
	swift infer \
	--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx-hf \
	--stream true \
	--temperature 0 \
	--max_new_tokens 2048
	```

	推理结果如下：
	```
	<<< who are you?
	I am a language model developed by swift, you can call me swift-robot. How can I assist you?
	```


	### Mcore-Bridge【推荐】

	在"ms-swift>=3.10"，支持了Mcore-Bridge，去除模型转换的繁琐过程。具体参考[Mcore-Bridge文档](./Mcore-Bridge.md)。

	训练脚本：
	```bash
	PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
	NPROC_PER_NODE=2 \
	CUDA_VISIBLE_DEVICES=0,1 \
	megatron sft \
	--model Qwen/Qwen2.5-7B-Instruct \
	--load_safetensors true \
	--save_safetensors true \
	--dataset 'AI-ModelScope/alpaca-gpt4-data-zh#500' \
	'AI-ModelScope/alpaca-gpt4-data-en#500' \
	'swift/self-cognition#500' \
	--tensor_model_parallel_size 2 \
	--sequence_parallel true \
	--micro_batch_size 16 \
	--global_batch_size 16 \
	--recompute_granularity full \
	--recompute_method uniform \
	--recompute_num_layers 1 \
	--finetune true \
	--cross_entropy_loss_fusion true \
	--lr 1e-5 \
	--lr_warmup_fraction 0.05 \
	--min_lr 1e-6 \
	--max_epochs 1 \
	--save megatron_output/Qwen2.5-7B-Instruct \
	--save_interval 100 \
	--max_length 2048 \
	--system 'You are a helpful assistant.' \
	--num_workers 4 \
	--no_save_optim true \
	--no_save_rng true \
	--dataset_num_proc 4 \
	--model_author swift \
	--model_name swift-robot
	```

	我们对生成的safetensors格式权重进行推理：
	```shell
	CUDA_VISIBLE_DEVICES=0 \
	swift infer \
	--model megatron_output/Qwen2.5-7B-Instruct/vx-xxx/checkpoint-xxx \
	--stream true \
	--temperature 0 \
	--max_new_tokens 2048
	```

	- 若要进行预训练，你可以使用`megatron pt`替代`megatron sft`，这将会使用生成式的template进行训练。
	- Megatron-SWIFT使用与ms-swift相同的dataset和template处理模块，因此同样支持packing、loss_scale、agent训练等技术。自定义数据集格式参考[自定义数据集文档](../Customization/Custom-dataset.md)。
	- 更多案例：包括packing、多机、32K上下文、DPO、MoE模型、预训练，可以查看[这里](https://github.com/modelscope/ms-swift/tree/main/examples/megatron)。


	## 训练技巧
	- 增加训练吞吐量方法：使用packing（不要开启流式）、增加DP、减少重计算、增加计算通信overlap。MoE还可以通过丢弃tokens加速。
	- 并行技术选择：
	- Megatron-SWIFT的并行技术采用zero1（默认开启use_distributed_optimizer）+各种并行技术的组合。
	- DP的速度最快，但显存占用较多，使用其他并行技术以降低显存占用。
	- TP/EP通信量较大，尽量不跨节点（NVLink域内），跨节点建议使用PP/DP；专家层建议使用EP而不是ETP，ETP更节约显存，但速度较慢。
	- MoE 并行折叠：MoE 相关的并行组与 Dense 组分离。Attention使用 tp-cp-dp-pp 组，MoE 使用 etp-ep-dp-pp 组。
	- 权重转换并行数的选择：Megatron-SWIFT在mcore端使用torch_dist存储格式，训练时可以调整并行数，不需要在权重转化时指定并行数。
	- 关于日志打印：Megatron-SWIFT的日志在last rank进行打印，因为在PP并行中，只有last pp_rank拥有完整的信息。

	## Benchmark

	使用`megatron sft`和`swift sft`在单机八卡A800环境下进行Dense模型全参数8K上下文训练的速度对比如下：

	Dense Qwen2.5-14B:

	\| \| Megatron-LM \| Deepspeed-ZeRO2 \| Deepspeed-ZeRO3 \|
	\| -------- \| ----------- \| ---------- \| ---------- \|
	\| 训练速度 \| 9.04s/it \| 10.32s/it \| 10.56s/it \|
	\| 显存占用 \| 8\64GB \| 8\80GB \| 8\*58GB \|

	使用`megatron sft`和`swift sft`在双机16卡A800环境下进行MoE模型全参数8K上下文训练的速度对比如下：

	MoE Qwen3-30B-A3B:

	\| \| Megatron-LM \| DeepSpeed-ZeRO2 \| DeepSpeed-ZeRO3 \|
	\| -------- \| ----------- \| --------------- \| --------------- \|
	\| 训练速度 \| 9.6s/it \| - \| 91.2s/it \|
	\| 显存使用 \| 16 * 60GiB \| OOM \| 16 * 80GiB \|


	## Megatron-SWIFT微信群

	<img src="https://raw.githubusercontent.com/modelscope/ms-swift/main/docs/resources/wechat/megatron.png" width="250">