# Elastic ## 安装依赖 集群部署K8S,并在集群中部署DLrover,[DLRover](https://github.com/intelligent-machine-learning/dlrover), `pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift` 经过反复测试验证的训练镜像中的其它依赖以及版本: deepspeed 0.16.5(需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题) pytorch 2.6.0 ## 如何启动 通过在`--callbacks`中添加`deepspeed_elastic`(可选`graceful_exit`)启用弹性训练,并配置DeepSpeed弹性参数。 命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外,其他参数与torchrun一致; dlrover-run 参数如下: ``` usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE] [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID] [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS] [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}] [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS] [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK] [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR] [--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT] [--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint] [--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT] [--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH] [--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check] [--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE] training_script ``` 在弹性训练中我们需要关注的参数为: --nnodes NNODES Number of nodes, or the range of nodes in form :. --nproc-per-node NPROC_PER_NODE Number of processes per node. 示例: ```bash model=your model path dataset=your dataset output= your output dir export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置 deepspeed_config_or_type=deepspeed类型或者配置文件的路径,如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1 \ /opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \ --model_type qwen3 \ --train_type lora \ --torch_dtype bfloat16 \ --dataset $dataset \ --num_train_epochs 4 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --learning_rate 5e-7 \ --gradient_accumulation_steps 8 \ --eval_steps 500 \ --save_steps 10 \ --save_total_limit 20 \ --logging_steps 1 \ --output_dir $output \ --warmup_ratio 0.01 \ --dataloader_num_workers 4 \ --temperature 1.0 \ --system You\ are\ a\ helpful\ assistant. \ --lora_rank 8 \ --lora_alpha 32 \ --target_modules all-linear \ --dataset_num_proc 1 \ --use_flash_ckpt true \ --callbacks deepspeed_elastic graceful_exit \ --deepspeed $deepspeed_config_or_type \ ``` ## 配置文件示例 默认情况下的zero1为以下示例配置, ```json { "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 1, "offload_optimizer": { "device": "none", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "overlap_comm": false, "reduce_scatter": true, "reduce_bucket_size": 2e8, "contiguous_gradients": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "steps_per_print": 2000, "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false, "elasticity": { "ignore_non_elastic_batch_info": true, "enabled": true, "max_train_batch_size": 8, "micro_batch_sizes": [ 4, 2 ], "min_gpus": 1, "max_gpus": 4, "min_time": 20, "version": 0.1 } } ``` 如果用户需要自定义,可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为: ```json ... "elasticity": { "ignore_non_elastic_batch_info": true, "enabled": true, "max_train_batch_size": 8, "micro_batch_sizes": [ 4, 2 ], "min_gpus": 1, "max_gpus": 4, "min_time": 20, "version": 0.1 } ``` - ignore_non_elastic_batch_info:代表在elasticity里的配置会忽略外层的batch_size相关的配置,训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数 计算原则为:  global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size - max_train_batch_size:最大batch_size数 - micro_batch_sizes:elasticity下允许的每卡micro-batch size列表,相当于train_micro_batch_size_per_gpu的候选值 - min_gpus:最小gpu数目 - max_gpus:最大gpu数目 更详细的内容见:[Deepspeed](https://www.deepspeed.ai/docs/config-json/#elastic-training-config-v01-and-v02) ## 启动训练 ```yaml --- apiVersion: elastic.iml.github.io/v1alpha1 kind: ElasticJob metadata: name: deepspeed-elastic-swift namespace: dlrover spec: distributionStrategy: AllreduceStrategy optimizeMode: single-job replicaSpecs: worker: replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】 template: spec: restartPolicy: Never containers: - name: main image: #【训练镜像,需要安装deepspeed,dlrover 和swift 】 imagePullPolicy: IfNotPresent command: - /bin/bash - -c - sh start.sh # 启动脚本 resources: limits: cpu: '8' memory: 16Gi nvidia.com/gpu: '1' volumeMounts: - mountPath: /model name: volume-model - mountPath: /dev/shm name: volume-shm restartPolicy: Never volumes: - hostPath: path: /model type: Directory name: volume-model - emptyDir: medium: Memory sizeLimit: 200Gi name: volume-shm ```