| ## Train a model | |
| MMSegmentation implements distributed training and non-distributed training, | |
| which uses `MMDistributedDataParallel` and `MMDataParallel` respectively. | |
| All outputs (log files and checkpoints) will be saved to the working directory, | |
| which is specified by `work_dir` in the config file. | |
| By default we evaluate the model on the validation set after some iterations, you can change the evaluation interval by adding the interval argument in the training config. | |
| ```python | |
| evaluation = dict(interval=4000) # This evaluate the model per 4000 iterations. | |
| ``` | |
| **\*Important\***: The default learning rate in config files is for 4 GPUs and 2 img/gpu (batch size = 4x2 = 8). | |
| Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN. | |
| To trade speed with GPU memory, you may pass in `--options model.backbone.with_cp=True` to enable checkpoint in backbone. | |
| ### Train with a single GPU | |
| ```shell | |
| python tools/train.py ${CONFIG_FILE} [optional arguments] | |
| ``` | |
| If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`. | |
| ### Train with multiple GPUs | |
| ```shell | |
| ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments] | |
| ``` | |
| Optional arguments are: | |
| - `--no-validate` (**not suggested**): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use `--no-validate`. | |
| - `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file. | |
| - `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process). | |
| - `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task). | |
| Difference between `resume-from` and `load-from`: | |
| - `resume-from` loads both the model weights and optimizer state including the iteration number. | |
| - `load-from` loads only the model weights, starts the training from iteration 0. | |
| ### Train with multiple machines | |
| If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.) | |
| ```shell | |
| [GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR} | |
| ``` | |
| Here is an example of using 16 GPUs to train PSPNet on the dev partition. | |
| ```shell | |
| GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes | |
| ``` | |
| You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables. | |
| If you have just multiple machines connected with ethernet, you can refer to | |
| PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility). | |
| Usually it is slow if you do not have high speed networking like InfiniBand. | |
| ### Launch multiple jobs on a single machine | |
| If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, | |
| you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`. | |
| If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`. | |
| ```shell | |
| CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4 | |
| CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4 | |
| ``` | |
| If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`. | |
| ```shell | |
| MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} | |
| MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} | |
| ``` | |