Elastic

安装依赖

集群部署K8S,并在集群中部署DLrover,DLRover, pip install dlrover && pip install tornado && pip install kubernetes && pip install ms-swift

经过反复测试验证的训练镜像中的其它依赖以及版本： deepspeed 0.16.5（需参考https://github.com/deepspeedai/DeepSpeed/pull/7585/files 修复universal checkpoint 相关问题） pytorch 2.6.0

如何启动

通过在--callbacks中添加deepspeed_elastic（可选graceful_exit）启用弹性训练，并配置DeepSpeed弹性参数。命令组成=dlrover-run +dlrover 命令参数+swift 启动命令 +swift参数,dlrover-run除自定义的参数外，其他参数与torchrun一致； dlrover-run 参数如下:

usage: dlrover-run [-h] [--nnodes NNODES] [--nproc-per-node NPROC_PER_NODE]
                   [--rdzv-backend RDZV_BACKEND] [--rdzv-endpoint RDZV_ENDPOINT] [--rdzv-id RDZV_ID]
                   [--rdzv-conf RDZV_CONF] [--standalone] [--max-restarts MAX_RESTARTS]
                   [--monitor-interval MONITOR_INTERVAL] [--start-method {spawn,fork,forkserver}]
                   [--role ROLE] [-m] [--no-python] [--run-path] [--log-dir LOG_DIR] [-r REDIRECTS]
                   [-t TEE] [--local-ranks-filter LOCAL_RANKS_FILTER] [--node-rank NODE_RANK]
                   [--master-addr MASTER_ADDR] [--master-port MASTER_PORT] [--local-addr LOCAL_ADDR]
                   [--logs-specs LOGS_SPECS] [--precheck {0,1,2}] [--node_unit NODE_UNIT]
                   [--auto_config] [--auto_tunning] [--exclude-straggler] [--save_at_breakpoint]
                   [--accelerator {nvidia.com/gpu,ascend-npu}] [--training_port TRAINING_PORT]
                   [--switchbox-check] [--box-pairs PAIR [PAIR ...]] [--min-bandwidth MIN_BANDWIDTH]
                   [--min-channels MIN_CHANNELS] [--numa-affinity] [--network-check]
                   [--comm-perf-test] [--ucp_device_type UCP_DEVICE_TYPE]
                   training_script

在弹性训练中我们需要关注的参数为：

--nnodes NNODES Number of nodes, or the range of nodes in form :.

--nproc-per-node NPROC_PER_NODE Number of processes per node. 示例:

model=your model path
dataset=your dataset
output= your output dir
export CUDA_VISIBLE_DEVICES=0 根据实际使用的GPU情况设置
deepspeed_config_or_type=deepspeed类型或者配置文件的路径，如 zero1 或者/xxx/ms-swift/swift/llm/ds_config/zero1.json

dlrover-run --nnodes 1:$NODE_NUM --nproc_per_node=1  \
/opt/conda/lib/python3.10/site-packages/swift/cli/sft.py --model $model \
--model_type qwen3 \
--train_type lora  \
--torch_dtype bfloat16 \
--dataset $dataset \
--num_train_epochs 4 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 5e-7 \
--gradient_accumulation_steps 8 \
--eval_steps 500 \
--save_steps 10 \
--save_total_limit 20 \
--logging_steps 1 \
--output_dir $output \
--warmup_ratio 0.01 \
--dataloader_num_workers 4 \
--temperature 1.0 \
--system You\ are\ a\ helpful\ assistant. \
--lora_rank 8 \
--lora_alpha 32 \
--target_modules all-linear \
--dataset_num_proc 1 \
--use_flash_ckpt true \
--callbacks deepspeed_elastic graceful_exit \
--deepspeed $deepspeed_config_or_type \

配置文件示例

默认情况下的zero1为以下示例配置,

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "bf16": {
        "enabled": "auto"
    },

    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": false,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false,
    "elasticity": {
        "ignore_non_elastic_batch_info": true,
        "enabled": true,
        "max_train_batch_size": 8,
        "micro_batch_sizes": [
          4,
          2
        ],
        "min_gpus": 1,
        "max_gpus": 4,
        "min_time": 20,
        "version": 0.1
      }
}

如果用户需要自定义，可以在启动命令中deepspeed_config_or_type指定自定义的zero1.json的存放路径,其中弹性相关的配置为：

...

  "elasticity": {
        "ignore_non_elastic_batch_info": true,
        "enabled": true,
        "max_train_batch_size": 8,
        "micro_batch_sizes": [
          4,
          2
        ],
        "min_gpus": 1,
        "max_gpus": 4,
        "min_time": 20,
        "version": 0.1
      }

ignore_non_elastic_batch_info：代表在elasticity里的配置会忽略外层的batch_size相关的配置，训练过程中会根据实际的训练进程个数实时修改batch_size等相关的参数计算原则为： global-training-batch-size = micro-batch-size * gradient-accumulation-steps * world-size
max_train_batch_size：最大batch_size数
micro_batch_sizes：elasticity下允许的每卡micro-batch size列表，相当于train_micro_batch_size_per_gpu的候选值
min_gpus：最小gpu数目
max_gpus：最大gpu数目更详细的内容见：Deepspeed

启动训练

---
apiVersion: elastic.iml.github.io/v1alpha1
kind: ElasticJob
metadata:
  name: deepspeed-elastic-swift
  namespace: dlrover
spec:
  distributionStrategy: AllreduceStrategy
  optimizeMode: single-job
  replicaSpecs:
    worker:
      replicas: 1 #【这里需要与启动命令中的--nnodes NNODES的最大值一致】
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: main
              image: #【训练镜像，需要安装deepspeed,dlrover 和swift 】
              imagePullPolicy: IfNotPresent
              command:
                - /bin/bash
                - -c
                - sh start.sh # 启动脚本
              resources:
                limits:
                  cpu: '8'
                  memory: 16Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - mountPath: /model
                  name: volume-model
                - mountPath: /dev/shm
                  name: volume-shm
          restartPolicy: Never
          volumes:
            - hostPath:
                path: /model
                type: Directory
              name: volume-model
            - emptyDir:
                medium: Memory
                sizeLimit: 200Gi
              name: volume-shm