weilllllls commited on
Commit
349967a
·
verified ·
1 Parent(s): ac90eaa

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -21
README.md CHANGED
@@ -1,7 +1,3 @@
1
- ---
2
- base_model:
3
- - genmo/mochi-1-preview
4
- ---
5
  <img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
6
 
7
  # [ICCV 2025] DreamRelation: Relation-Centric Video Customization
@@ -20,31 +16,44 @@ To address these challenges, we propose **DreamRelation**, a novel approach that
20
 
21
  ## Installation
22
 
23
- 1. **Prepare Environment and Pretrained Models**
24
 
25
  Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
26
 
27
- 2. **Place Mochi 1 Checkpoints**
28
 
29
  Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
30
 
31
- 3. **Place T5 Checkpoints**
32
 
33
  Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
34
 
35
 
36
  ## Examples
37
 
38
- 1. **Download Trained LoRAs of Examples**
39
 
40
- We provide the trained LoRAs of examples in `examples/high-five` and `examples/shaking_hands`.
41
 
42
- 2. **Run Example Command**
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  ```bash
 
45
  CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
46
  --model_dir pretrain_models_ckpt/mochi-1-preview/ \
47
- --lora_path examples/shaking_hands/model_2000.lora.pt \
48
  --num_frames 61 \
49
  --cpu_offload \
50
  --prompt "A bear is shaking hands with a raccoon in a meadow." \
@@ -52,9 +61,10 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
52
  --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
53
  --test_lora_names relation
54
 
 
55
  CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
56
  --model_dir pretrain_models_ckpt/mochi-1-preview/ \
57
- --lora_path examples/high-five/model_1800.lora.pt \
58
  --num_frames 61 \
59
  --cpu_offload \
60
  --prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
@@ -65,11 +75,16 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
65
 
66
  ## Training
67
 
68
- 1. **Prepare Dataset**
 
 
 
 
 
 
69
 
70
- For DreamRelation, the training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). You can also use your own relational videos; however, they should follow the structure of our provided examples, such as the "shaking hands" action videos and their captions located in `videos/NTU_RGB_D/A58-shaking_hands`.
71
 
72
- 2. **Preprocess Videos**
73
 
74
  First, install `bc` if you don't have it:
75
 
@@ -86,14 +101,13 @@ bash demos/fine_tuner/preprocess.bash -v videos/NTU_RGB_D/A58-shaking_hands/ -o
86
 
87
  You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
88
 
89
- 3. **Prepare Video Masks**
90
 
91
  Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
 
92
 
93
- We have provided the example masks for the videos in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
94
 
95
-
96
- 4. **Start Training**
97
 
98
  ```bash
99
  CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
@@ -102,7 +116,7 @@ CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-p
102
  **Training Tips:**
103
  * Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
104
  * Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
105
- * A GPU with 80 GB VRAM is recommended for training.
106
  * The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
107
 
108
 
@@ -137,4 +151,4 @@ If you find this code useful for your research, please cite our paper:
137
  journal={arXiv preprint arXiv:2503.07602},
138
  year={2025}
139
  }
140
- ```
 
 
 
 
 
1
  <img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
2
 
3
  # [ICCV 2025] DreamRelation: Relation-Centric Video Customization
 
16
 
17
  ## Installation
18
 
19
+ ### 1. Prepare Environment and Pretrained Models
20
 
21
  Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
22
 
23
+ ### 2. Place Mochi 1 Checkpoints
24
 
25
  Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
26
 
27
+ ### 3. Place T5 Checkpoints
28
 
29
  Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
30
 
31
 
32
  ## Examples
33
 
34
+ ### 1. Download Pre-trained Models
35
 
36
+ We provide pre-trained LoRA models for the `high-five` and `shaking hands` relations to help you get started. You can download them from either ModelScope or Hugging Face Hub.
37
 
38
+ **Download from ModelScope**
39
+ ```bash
40
+ pip install modelscope==1.23.0
41
+ modelscope download --model weilllllls/DreamRelation --local_dir checkpoints
42
+ ```
43
+
44
+ **Download from Hugging Face**
45
+ ```bash
46
+ pip install -U huggingface_hub
47
+ huggingface-cli download --resume-download weilllllls/DreamRelation --local-dir checkpoints
48
+ ```
49
+
50
+ ### 2. Run Example Command
51
 
52
  ```bash
53
+ # shaking hands
54
  CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
55
  --model_dir pretrain_models_ckpt/mochi-1-preview/ \
56
+ --lora_path checkpoints/examples/shaking_hands/model_2000.lora.pt \
57
  --num_frames 61 \
58
  --cpu_offload \
59
  --prompt "A bear is shaking hands with a raccoon in a meadow." \
 
61
  --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
62
  --test_lora_names relation
63
 
64
+ # high-five
65
  CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
66
  --model_dir pretrain_models_ckpt/mochi-1-preview/ \
67
+ --lora_path checkpoints/examples/high-five/model_1800.lora.pt \
68
  --num_frames 61 \
69
  --cpu_offload \
70
  --prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
 
75
 
76
  ## Training
77
 
78
+ ### 1. Prepare Dataset
79
+
80
+ The training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). **Due to the dataset's usage policy, we cannot directly provide the NTU RGB+D relational videos**. Please apply for authorization to the NTU RGB+D dataset yourself and download the relevant relational videos. The names of all 26 relation types and their corresponding captions are provided in Table 8 of our paper.
81
+
82
+ Once the training videos are prepared, randomly select 20–30 videos for each relation and place them under `videos/NTU_RGB_D`.
83
+ For each video, also create a text file with the same name to contain the video's caption.
84
+ For example, for one `A58: shaking hands` video `S001C001P001R001A058_rgb.mp4`, place it under `videos/NTU_RGB_D/A58-shaking_hands` and create `S001C001P001R001A058_rgb.txt`, writing the caption "A person is shaking hands with a person." into the file.
85
 
 
86
 
87
+ ### 2. Preprocess Videos
88
 
89
  First, install `bc` if you don't have it:
90
 
 
101
 
102
  You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
103
 
104
+ ### 3. Prepare Video Masks
105
 
106
  Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
107
+ We have provided the example masks of one video in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
108
 
 
109
 
110
+ ### 4. Start Training
 
111
 
112
  ```bash
113
  CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
 
116
  **Training Tips:**
117
  * Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
118
  * Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
119
+ * A GPU with 80 GB VRAM is recommended for training. If you have limited VRAM, you can reduce the number of frames in the `Preprocess Videos` step. The default number of frames for training is 61.
120
  * The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
121
 
122
 
 
151
  journal={arXiv preprint arXiv:2503.07602},
152
  year={2025}
153
  }
154
+ ```