Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,7 +1,3 @@
|
|
| 1 |
-
---
|
| 2 |
-
base_model:
|
| 3 |
-
- genmo/mochi-1-preview
|
| 4 |
-
---
|
| 5 |
<img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
|
| 6 |
|
| 7 |
# [ICCV 2025] DreamRelation: Relation-Centric Video Customization
|
|
@@ -20,31 +16,44 @@ To address these challenges, we propose **DreamRelation**, a novel approach that
|
|
| 20 |
|
| 21 |
## Installation
|
| 22 |
|
| 23 |
-
1.
|
| 24 |
|
| 25 |
Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
|
| 26 |
|
| 27 |
-
2.
|
| 28 |
|
| 29 |
Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
|
| 30 |
|
| 31 |
-
3.
|
| 32 |
|
| 33 |
Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
|
| 34 |
|
| 35 |
|
| 36 |
## Examples
|
| 37 |
|
| 38 |
-
1.
|
| 39 |
|
| 40 |
-
We provide
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
```bash
|
|
|
|
| 45 |
CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
| 46 |
--model_dir pretrain_models_ckpt/mochi-1-preview/ \
|
| 47 |
-
--lora_path examples/shaking_hands/model_2000.lora.pt \
|
| 48 |
--num_frames 61 \
|
| 49 |
--cpu_offload \
|
| 50 |
--prompt "A bear is shaking hands with a raccoon in a meadow." \
|
|
@@ -52,9 +61,10 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
|
| 52 |
--train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
|
| 53 |
--test_lora_names relation
|
| 54 |
|
|
|
|
| 55 |
CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
| 56 |
--model_dir pretrain_models_ckpt/mochi-1-preview/ \
|
| 57 |
-
--lora_path examples/high-five/model_1800.lora.pt \
|
| 58 |
--num_frames 61 \
|
| 59 |
--cpu_offload \
|
| 60 |
--prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
|
|
@@ -65,11 +75,16 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
|
| 65 |
|
| 66 |
## Training
|
| 67 |
|
| 68 |
-
1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
For DreamRelation, the training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). You can also use your own relational videos; however, they should follow the structure of our provided examples, such as the "shaking hands" action videos and their captions located in `videos/NTU_RGB_D/A58-shaking_hands`.
|
| 71 |
|
| 72 |
-
2.
|
| 73 |
|
| 74 |
First, install `bc` if you don't have it:
|
| 75 |
|
|
@@ -86,14 +101,13 @@ bash demos/fine_tuner/preprocess.bash -v videos/NTU_RGB_D/A58-shaking_hands/ -o
|
|
| 86 |
|
| 87 |
You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
|
| 88 |
|
| 89 |
-
3.
|
| 90 |
|
| 91 |
Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
|
|
|
|
| 92 |
|
| 93 |
-
We have provided the example masks for the videos in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
|
| 94 |
|
| 95 |
-
|
| 96 |
-
4. **Start Training**
|
| 97 |
|
| 98 |
```bash
|
| 99 |
CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
|
|
@@ -102,7 +116,7 @@ CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-p
|
|
| 102 |
**Training Tips:**
|
| 103 |
* Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
|
| 104 |
* Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
|
| 105 |
-
* A GPU with 80 GB VRAM is recommended for training.
|
| 106 |
* The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
|
| 107 |
|
| 108 |
|
|
@@ -137,4 +151,4 @@ If you find this code useful for your research, please cite our paper:
|
|
| 137 |
journal={arXiv preprint arXiv:2503.07602},
|
| 138 |
year={2025}
|
| 139 |
}
|
| 140 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
<img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
|
| 2 |
|
| 3 |
# [ICCV 2025] DreamRelation: Relation-Centric Video Customization
|
|
|
|
| 16 |
|
| 17 |
## Installation
|
| 18 |
|
| 19 |
+
### 1. Prepare Environment and Pretrained Models
|
| 20 |
|
| 21 |
Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
|
| 22 |
|
| 23 |
+
### 2. Place Mochi 1 Checkpoints
|
| 24 |
|
| 25 |
Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
|
| 26 |
|
| 27 |
+
### 3. Place T5 Checkpoints
|
| 28 |
|
| 29 |
Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
|
| 30 |
|
| 31 |
|
| 32 |
## Examples
|
| 33 |
|
| 34 |
+
### 1. Download Pre-trained Models
|
| 35 |
|
| 36 |
+
We provide pre-trained LoRA models for the `high-five` and `shaking hands` relations to help you get started. You can download them from either ModelScope or Hugging Face Hub.
|
| 37 |
|
| 38 |
+
**Download from ModelScope**
|
| 39 |
+
```bash
|
| 40 |
+
pip install modelscope==1.23.0
|
| 41 |
+
modelscope download --model weilllllls/DreamRelation --local_dir checkpoints
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
**Download from Hugging Face**
|
| 45 |
+
```bash
|
| 46 |
+
pip install -U huggingface_hub
|
| 47 |
+
huggingface-cli download --resume-download weilllllls/DreamRelation --local-dir checkpoints
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
### 2. Run Example Command
|
| 51 |
|
| 52 |
```bash
|
| 53 |
+
# shaking hands
|
| 54 |
CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
| 55 |
--model_dir pretrain_models_ckpt/mochi-1-preview/ \
|
| 56 |
+
--lora_path checkpoints/examples/shaking_hands/model_2000.lora.pt \
|
| 57 |
--num_frames 61 \
|
| 58 |
--cpu_offload \
|
| 59 |
--prompt "A bear is shaking hands with a raccoon in a meadow." \
|
|
|
|
| 61 |
--train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
|
| 62 |
--test_lora_names relation
|
| 63 |
|
| 64 |
+
# high-five
|
| 65 |
CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
|
| 66 |
--model_dir pretrain_models_ckpt/mochi-1-preview/ \
|
| 67 |
+
--lora_path checkpoints/examples/high-five/model_1800.lora.pt \
|
| 68 |
--num_frames 61 \
|
| 69 |
--cpu_offload \
|
| 70 |
--prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
|
|
|
|
| 75 |
|
| 76 |
## Training
|
| 77 |
|
| 78 |
+
### 1. Prepare Dataset
|
| 79 |
+
|
| 80 |
+
The training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). **Due to the dataset's usage policy, we cannot directly provide the NTU RGB+D relational videos**. Please apply for authorization to the NTU RGB+D dataset yourself and download the relevant relational videos. The names of all 26 relation types and their corresponding captions are provided in Table 8 of our paper.
|
| 81 |
+
|
| 82 |
+
Once the training videos are prepared, randomly select 20–30 videos for each relation and place them under `videos/NTU_RGB_D`.
|
| 83 |
+
For each video, also create a text file with the same name to contain the video's caption.
|
| 84 |
+
For example, for one `A58: shaking hands` video `S001C001P001R001A058_rgb.mp4`, place it under `videos/NTU_RGB_D/A58-shaking_hands` and create `S001C001P001R001A058_rgb.txt`, writing the caption "A person is shaking hands with a person." into the file.
|
| 85 |
|
|
|
|
| 86 |
|
| 87 |
+
### 2. Preprocess Videos
|
| 88 |
|
| 89 |
First, install `bc` if you don't have it:
|
| 90 |
|
|
|
|
| 101 |
|
| 102 |
You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
|
| 103 |
|
| 104 |
+
### 3. Prepare Video Masks
|
| 105 |
|
| 106 |
Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
|
| 107 |
+
We have provided the example masks of one video in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
|
| 108 |
|
|
|
|
| 109 |
|
| 110 |
+
### 4. Start Training
|
|
|
|
| 111 |
|
| 112 |
```bash
|
| 113 |
CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
|
|
|
|
| 116 |
**Training Tips:**
|
| 117 |
* Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
|
| 118 |
* Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
|
| 119 |
+
* A GPU with 80 GB VRAM is recommended for training. If you have limited VRAM, you can reduce the number of frames in the `Preprocess Videos` step. The default number of frames for training is 61.
|
| 120 |
* The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
|
| 121 |
|
| 122 |
|
|
|
|
| 151 |
journal={arXiv preprint arXiv:2503.07602},
|
| 152 |
year={2025}
|
| 153 |
}
|
| 154 |
+
```
|