weilllllls commited on
Commit
2bee80e
·
verified ·
1 Parent(s): 0ead5c6

Upload 4 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/dreamRelation-logo.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,136 @@
1
- ---
2
- license: cc-by-nc-nd-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
2
+
3
+ # [ICCV 2025] DreamRelation: Relation-Centric Video Customization
4
+
5
+ [![arXiv](https://img.shields.io/badge/arXiv-2503.07602-b31b1b.svg)](https://arxiv.org/abs/2503.07602) [![Project Page](https://img.shields.io/badge/Project%20Page-DreamRelation-green.svg)](https://dreamrelation.github.io/)
6
+
7
+
8
+ [Yujie Wei](https://weilllllls.github.io), [Shiwei Zhang](https://scholar.google.com.hk/citations?user=ZO3OQ-8AAAAJ), [Hangjie Yuan](https://jacobyuan7.github.io), [Biao Gong](https://scholar.google.com/citations?user=BwdpTiQAAAAJ), [Longxiang Tang](https://scholar.google.com/citations?user=3oMQsq8AAAAJ), [Xiang Wang](https://scholar.google.com/citations?user=cQbXvkcAAAAJ), [Haonan Qiu](http://haonanqiu.com), [Hengjia Li](https://echopluto.github.io/HomePage/), [Shuai Tan](), [Yingya Zhang](https://scholar.google.com/citations?user=16RDSEUAAAAJ), [Hongming Shan](http://hmshan.io)
9
+
10
+ Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending realworld visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions.
11
+
12
+ ![method](https://img.alicdn.com/imgextra/i3/O1CN01dh5jW21NEuaKht7w8_!!6000000001539-0-tps-2708-820.jpg "Method")
13
+
14
+ To address these challenges, we propose **DreamRelation**, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce spacetime relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization.
15
+
16
+
17
+ ## Installation
18
+
19
+ 1. **Prepare Environment and Pretrained Models**
20
+
21
+ Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
22
+
23
+ 2. **Place Mochi 1 Checkpoints**
24
+
25
+ Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
26
+
27
+ 3. **Place T5 Checkpoints**
28
+
29
+ Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
30
+
31
+
32
+ ## Examples
33
+
34
+ 1. **Download Trained LoRAs of Examples**
35
+
36
+ We provide the trained LoRAs of examples in `examples/high-five` and `examples/shaking_hands`.
37
+
38
+ 2. **Run Example Command**
39
+
40
+ ```bash
41
+ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
42
+ --model_dir pretrain_models_ckpt/mochi-1-preview/ \
43
+ --lora_path examples/shaking_hands/model_2000.lora.pt \
44
+ --num_frames 61 \
45
+ --cpu_offload \
46
+ --prompt "A bear is shaking hands with a raccoon in a meadow." \
47
+ --seed 522 \
48
+ --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
49
+ --test_lora_names relation
50
+
51
+ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
52
+ --model_dir pretrain_models_ckpt/mochi-1-preview/ \
53
+ --lora_path examples/high-five/model_1800.lora.pt \
54
+ --num_frames 61 \
55
+ --cpu_offload \
56
+ --prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
57
+ --seed 1832 \
58
+ --train_cfg_path demos/fine_tuner/configs/high-five/example.yaml \
59
+ --test_lora_names relation
60
+ ```
61
+
62
+ ## Training
63
+
64
+ 1. **Prepare Dataset**
65
+
66
+ For DreamRelation, the training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). You can also use your own relational videos; however, they should follow the structure of our provided examples, such as the "shaking hands" action videos and their captions located in `videos/NTU_RGB_D/A58-shaking_hands`.
67
+
68
+ 2. **Preprocess Videos**
69
+
70
+ First, install `bc` if you don't have it:
71
+
72
+ ```bash
73
+ sudo apt-get update
74
+ sudo apt-get install bc
75
+ ```
76
+
77
+ Then, preprocess your videos using the script:
78
+
79
+ ```bash
80
+ bash demos/fine_tuner/preprocess.bash -v videos/NTU_RGB_D/A58-shaking_hands/ -o videos_prepared/NTU_RGB_D/A58-shaking_hands/ -w pretrain_models_ckpt/mochi-1-preview --num_frames 61
81
+ ```
82
+
83
+ You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
84
+
85
+ 3. **Prepare Video Masks**
86
+
87
+ Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
88
+
89
+ We have provided the example masks for the videos in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
90
+
91
+
92
+ 4. **Start Training**
93
+
94
+ ```bash
95
+ CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
96
+ ```
97
+
98
+ **Training Tips:**
99
+ * Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
100
+ * Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
101
+ * A GPU with 80 GB VRAM is recommended for training.
102
+ * The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
103
+
104
+
105
+ ## Inference
106
+
107
+ To run inference using a trained LoRA model:
108
+
109
+ ```bash
110
+ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
111
+ --model_dir pretrain_models_ckpt/mochi-1-preview/ \
112
+ --lora_path finetunes/example/NTU_RGB_D/A58-shaking_hands/model_2000.lora.pt \
113
+ --num_frames 61 \
114
+ --cpu_offload \
115
+ --prompt "A bear is shaking hands with a raccoon in a meadow." \
116
+ --seed 522 \
117
+ --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
118
+ --test_lora_names relation
119
+ ```
120
+
121
+ ## Acknowledgement
122
+
123
+ This code is built on top of [Mochi 1](https://github.com/genmoai/mochi). We thank the authors for their great work.
124
+
125
+ ## Citation
126
+
127
+ If you find this code useful for your research, please cite our paper:
128
+
129
+ ```bash
130
+ @article{wei2025dreamrelation,
131
+ title={Dreamrelation: Relation-centric video customization},
132
+ author={Wei, Yujie and Zhang, Shiwei and Yuan, Hangjie and Gong, Biao and Tang, Longxiang and Wang, Xiang and Qiu, Haonan and Li, Hengjia and Tan, Shuai and Zhang, Yingya and others},
133
+ journal={arXiv preprint arXiv:2503.07602},
134
+ year={2025}
135
+ }
136
+ ```
assets/dreamRelation-logo.png ADDED

Git LFS Details

  • SHA256: 081083e6a73c7be1bc402678d91456354b363c232c562e16a93ad5c043dc3a89
  • Pointer size: 131 Bytes
  • Size of remote file: 862 kB
examples/high-five/model_1800.lora.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44cec2be85a0cec35b0042e6aa50993b7c86e7444e7e2d960c9180616a466181
3
+ size 82749830
examples/shaking_hands/model_2000.lora.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27aa938aafcc42d6fe8957e91c70828d603e22d72700d42c6ec3374e03d2d399
3
+ size 82749830