Upload folder using huggingface_hub

7134ce7 verified about 2 months ago

7.2 kB

	# Elastic



	## 安装依赖

	集群部署K8S,并在集群中部署DLrover,[DLRover](https://github.com/intelligent-machine-learning/dlrover),
	`pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift`

	经过反复测试验证的训练镜像中的其它依赖以及版本：
	deepspeed 0.16.5（需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题）
	pytorch 2.6.0


	## 如何启动

	通过在`--callbacks`中添加`deepspeed_elastic`（可选`graceful_exit`）启用弹性训练，并配置DeepSpeed弹性参数。
	命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外，其他参数与torchrun一致；
	dlrover-run 参数如下:
	```
	usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
	[--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID]
	[--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
	[--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}]
	[--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
	[-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK]
	[--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
	[--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT]
	[--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint]
	[--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT]
	[--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH]
	[--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check]
	[--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE]
	training_script

	```
	在弹性训练中我们需要关注的参数为：

	--nnodes NNODES Number of nodes, or the range of nodes in form
	<minimum_nodes>:<maximum_nodes>.

	--nproc-per-node NPROC_PER_NODE Number of processes per node.
	示例:

	```bash
	model=your model path
	dataset=your dataset
	output= your output dir
	export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置
	deepspeed_config_or_type=deepspeed类型或者配置文件的路径，如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json

	dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \
	/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
	--model_type qwen3 \
	--train_type lora \
	--torch_dtype bfloat16 \
	--dataset $dataset \
	--num_train_epochs 4 \
	--per_device_train_batch_size 1 \
	--per_device_eval_batch_size 1 \
	--learning_rate 5e-7 \
	--gradient_accumulation_steps 8 \
	--eval_steps 500 \
	--save_steps 10 \
	--save_total_limit 20 \
	--logging_steps 1 \
	--output_dir $output \
	--warmup_ratio 0.01 \
	--dataloader_num_workers 4 \
	--temperature 1.0 \
	--system You\ are\ a\ helpful\ assistant. \
	--lora_rank 8 \
	--lora_alpha 32 \
	--target_modules all-linear \
	--dataset_num_proc 1 \
	--use_flash_ckpt true \
	--callbacks deepspeed_elastic graceful_exit \
	--deepspeed $deepspeed_config_or_type \
	```

	## 配置文件示例
	默认情况下的zero1为以下示例配置,

	```json
	{
	"fp16": {
	"enabled": "auto",
	"loss_scale": 0,
	"loss_scale_window": 1000,
	"initial_scale_power": 16,
	"hysteresis": 2,
	"min_loss_scale": 1
	},

	"bf16": {
	"enabled": "auto"
	},

	"zero_optimization": {
	"stage": 1,
	"offload_optimizer": {
	"device": "none",
	"pin_memory": true
	},
	"allgather_partitions": true,
	"allgather_bucket_size": 2e8,
	"overlap_comm": false,
	"reduce_scatter": true,
	"reduce_bucket_size": 2e8,
	"contiguous_gradients": true
	},

	"gradient_accumulation_steps": "auto",
	"gradient_clipping": "auto",
	"steps_per_print": 2000,
	"train_batch_size": "auto",
	"train_micro_batch_size_per_gpu": "auto",
	"wall_clock_breakdown": false,
	"elasticity": {
	"ignore_non_elastic_batch_info": true,
	"enabled": true,
	"max_train_batch_size": 8,
	"micro_batch_sizes": [
	4,
	2
	],
	"min_gpus": 1,
	"max_gpus": 4,
	"min_time": 20,
	"version": 0.1
	}
	}
	```

	如果用户需要自定义，可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为：
	```json
	...

	"elasticity": {
	"ignore_non_elastic_batch_info": true,
	"enabled": true,
	"max_train_batch_size": 8,
	"micro_batch_sizes": [
	4,
	2
	],
	"min_gpus": 1,
	"max_gpus": 4,
	"min_time": 20,
	"version": 0.1
	}
	```

	- ignore_non_elastic_batch_info：代表在elasticity里的配置会忽略外层的batch_size相关的配置，训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数
	计算原则为：
	global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size
	- max_train_batch_size：最大batch_size数
	- micro_batch_sizes：elasticity下允许的每卡micro-batch size列表，相当于train_micro_batch_size_per_gpu的候选值
	- min_gpus：最小gpu数目
	- max_gpus：最大gpu数目
	更详细的内容见：[Deepspeed](https://www.deepspeed.ai/docs/config-json/#elastic-training-config-v01-and-v02)


	## 启动训练

	```yaml
	---
	apiVersion: elastic.iml.github.io/v1alpha1
	kind: ElasticJob
	metadata:
	name: deepspeed-elastic-swift
	namespace: dlrover
	spec:
	distributionStrategy: AllreduceStrategy
	optimizeMode: single-job
	replicaSpecs:
	worker:
	replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】
	template:
	spec:
	restartPolicy: Never
	containers:
	- name: main
	image: #【训练镜像，需要安装deepspeed,dlrover 和swift 】
	imagePullPolicy: IfNotPresent
	command:
	- /bin/bash
	- -c
	- sh start.sh # 启动脚本
	resources:
	limits:
	cpu: '8'
	memory: 16Gi
	nvidia.com/gpu: '1'
	volumeMounts:
	- mountPath: /model
	name: volume-model
	- mountPath: /dev/shm
	name: volume-shm
	restartPolicy: Never
	volumes:
	- hostPath:
	path: /model
	type: Directory
	name: volume-model
	- emptyDir:
	medium: Memory
	sizeLimit: 200Gi
	name: volume-shm

	```