# 🔥 Training Guidelines for DVD This document provides a comprehensive guide to training **DVD (Deterministic Video Depth)**. ## 1. 📂 Key Files Overview Before starting, it is helpful to understand the core scripts involved in the training process: * `train_script/train_video_new.sh`,`examples/wanvideo/model_training/train_with_accelerate_video.py`: The main entry point for the training loop. * `examples/wanvideo/model_training/WanTrainingModule.py`: Handles training and validation logic. Please note that we only validate on a single window during training. Please consider using the inference script to perform more validation to save time if needed. * `examples/dataset`: Handles dataset (both train and val). * `train_config/normal_config/video_config_new.yaml`: Contains all hyperparameters, including learning rate, batch size, dataset config, and so on. * `diffsynth/pipelines/wan_video_new_determine.py`: The core model architecture. --- ## 2. 🗄️ Dataset Preparation As mentioned in our paper, DVD requires only **367K frames** to unlock generative priors. We mainly [Hypersim](https://github.com/apple/ml-hypersim) (image), [TartanAir](https://theairlab.org/tartanair-dataset/) (video) and [Virtual KITTI](https://europe.naverlabs.com/proxy-virtual-worlds-vkitti-2/) (video,image) for training. Please download the raw datasets from their official websites and organize them as follows: ``` vkitti ├── Scene01 ├── Scene02 ├── ... hypersim/ ├── test ├── train └── val ttr ├── abandonedfactory ├── abandonedfactory_night ├── amusement ├── ... ``` --- ## 3. ⚙️ Configuration All training hyperparameters are centralized in `configs/train_config.yaml`. Key parameters you might want to adjust based on your hardware: * `batch_size`: Reduce this if you encounter Out-Of-Memory (OOM) errors. * `gradient_accumulation_steps`: Increase this to maintain the effective batch size if you reduce `batch_size`. * `use_gradient_checkpointing` : Set this to `True` if you are facing OOM errors. * `learning_rate`: Default is set to `1e-4`. * `{test/train}_{min/max}_num_frame`: The number of frames processed in one clip (default is e.g., 45-45). * `denoise_step`: The $\tau$ condition in our paper. * `grad_loss` , `grad_co`: The LMR and the $\lambda_{LMR}$ in our paper. * `lora_rank`: Set to 512 following [Lotus-2](https://lotus-2.github.io/). * `init_validate`: Whether to perform initial validation before training. * `log_step`: Interval for logging training state. * `prob`: The ratio for Hypersim(img), Virtual KITTI(img), TartanAir(vid), Virtual KITTI(vid). * `batch_size`: The batch for **image**. The default batch_size for video is 1. * `dataset settings`: Please refer to `examples/dataset` for more details. --- ## 4. 🚀 Launching the Training Make sure you have downloaded the base weights (e.g., Wan2.1) before starting (You will automatically download the weight if you are using the training script we provide). ### Multi-GPU / Distributed Training (Recommended) We use `accelerate` for multi-GPU training. To train on 4 GPUs: ```bash bash train_script/train_video_new.sh ``` You might also alter the files under `train_config/accelerate_config` to change the GPU configuration(e.g., for single GPU or DEEPSPEED). --- ## 4. 📊 Checkpoints ### Resuming from a Checkpoint Checkpoints are saved automatically in the `output_path` directory every `validate_step`. If your training is interrupted, you can resume it by specifying the `training_state_dir` and setting `resume` and `load_optimizer` to `True`. Please also set `global_step` if possible. Then just simply rerun the training script: ```bash bash train_script/train_video_new.sh ``` Please refer to [this line in the training script](../examples/wanvideo/model_training/train_with_accelerate_video.py#L474) for more details.