weilllllls
/

DreamRelation

Model card Files Files and versions

xet

Community

weilllllls commited on Nov 4, 2025

Commit

349967a

verified ·

1 Parent(s): ac90eaa

Upload README.md

Browse files

Files changed (1) hide show

README.md +35 -21

README.md CHANGED Viewed

@@ -1,7 +1,3 @@
----
-base_model:
-- genmo/mochi-1-preview
----
 <img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
 # [ICCV 2025] DreamRelation: Relation-Centric Video Customization
@@ -20,31 +16,44 @@ To address these challenges, we propose **DreamRelation**, a novel approach that
 ## Installation
-1. **Prepare Environment and Pretrained Models**
 Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
-2. **Place Mochi 1 Checkpoints**
 Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
-3. **Place T5 Checkpoints**
 Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
 ## Examples
-1. **Download Trained LoRAs of Examples**
-We provide the trained LoRAs of examples in `examples/high-five` and `examples/shaking_hands`.
-2. **Run Example Command**
 ```bash
 CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
     --model_dir pretrain_models_ckpt/mochi-1-preview/ \
-    --lora_path examples/shaking_hands/model_2000.lora.pt \
     --num_frames 61 \
     --cpu_offload \
     --prompt "A bear is shaking hands with a raccoon in a meadow." \
@@ -52,9 +61,10 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
     --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
     --test_lora_names relation
 CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
     --model_dir pretrain_models_ckpt/mochi-1-preview/ \
-    --lora_path examples/high-five/model_1800.lora.pt \
     --num_frames 61 \
     --cpu_offload \
     --prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
@@ -65,11 +75,16 @@ CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
 ## Training
-1. **Prepare Dataset**
-For DreamRelation, the training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). You can also use your own relational videos; however, they should follow the structure of our provided examples, such as the "shaking hands" action videos and their captions located in `videos/NTU_RGB_D/A58-shaking_hands`.
-2. **Preprocess Videos**
 First, install `bc` if you don't have it:
@@ -86,14 +101,13 @@ bash demos/fine_tuner/preprocess.bash -v videos/NTU_RGB_D/A58-shaking_hands/ -o
 You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
-3. **Prepare Video Masks**
 Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
-We have provided the example masks for the videos in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
-4. **Start Training**
 ```bash
 CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
@@ -102,7 +116,7 @@ CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-p
 **Training Tips:**
 *   Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
 *   Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
-*   A GPU with 80 GB VRAM is recommended for training.
 *   The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
@@ -137,4 +151,4 @@ If you find this code useful for your research, please cite our paper:
   journal={arXiv preprint arXiv:2503.07602},
   year={2025}
 }
-```

 <img src="./assets/dreamRelation-logo.png" alt="header" width="25%" style="display: block; margin: 0 auto;">
 # [ICCV 2025] DreamRelation: Relation-Centric Video Customization
 ## Installation
+### 1. Prepare Environment and Pretrained Models
 Follow the official installation instructions for Mochi 1 to set up your environment and download the necessary pretrained models. You can find the guide here: [Mochi 1 Official Repository](https://github.com/genmoai/mochi?tab=readme-ov-file#installation).
+### 2. Place Mochi 1 Checkpoints
 Move the downloaded Mochi 1 pretrained checkpoints into the `pretrained_models/mochi` directory.
+### 3. Place T5 Checkpoints
 Move the downloaded T5 pretrained checkpoints into the `pretrained_models/t5-v1_1-xxl` directory.
 ## Examples
+### 1. Download Pre-trained Models
+We provide pre-trained LoRA models for the `high-five` and `shaking hands` relations to help you get started. You can download them from either ModelScope or Hugging Face Hub.
+**Download from ModelScope**
+```bash
+pip install modelscope==1.23.0
+modelscope download --model weilllllls/DreamRelation --local_dir checkpoints
+```
+**Download from Hugging Face**
+```bash
+pip install -U huggingface_hub
+huggingface-cli download --resume-download weilllllls/DreamRelation --local-dir checkpoints
+```
+### 2. Run Example Command
 ```bash
+# shaking hands
 CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
     --model_dir pretrain_models_ckpt/mochi-1-preview/ \
+    --lora_path checkpoints/examples/shaking_hands/model_2000.lora.pt \
     --num_frames 61 \
     --cpu_offload \
     --prompt "A bear is shaking hands with a raccoon in a meadow." \
     --train_cfg_path demos/fine_tuner/configs/shaking_hands/example.yaml \
     --test_lora_names relation
+# high-five
 CUDA_VISIBLE_DEVICES=0 python demos/cli.py \
     --model_dir pretrain_models_ckpt/mochi-1-preview/ \
+    --lora_path checkpoints/examples/high-five/model_1800.lora.pt \
     --num_frames 61 \
     --cpu_offload \
     --prompt "A real-world bear is high-fiving with a real-world raccoon in a serene forest clearing." \
 ## Training
+### 1. Prepare Dataset
+The training videos are sourced from the [NTU RGB+D dataset](https://rose1.ntu.edu.sg/dataset/actionRecognition/). **Due to the dataset's usage policy, we cannot directly provide the NTU RGB+D relational videos**. Please apply for authorization to the NTU RGB+D dataset yourself and download the relevant relational videos. The names of all 26 relation types and their corresponding captions are provided in Table 8 of our paper.
+Once the training videos are prepared, randomly select 20–30 videos for each relation and place them under `videos/NTU_RGB_D`.
+For each video, also create a text file with the same name to contain the video's caption.
+For example, for one `A58: shaking hands` video `S001C001P001R001A058_rgb.mp4`, place it under `videos/NTU_RGB_D/A58-shaking_hands` and create `S001C001P001R001A058_rgb.txt`, writing the caption "A person is shaking hands with a person." into the file.
+### 2. Preprocess Videos
 First, install `bc` if you don't have it:
 You can adjust the `--num_frames` flag to specify the number of frames to extract for each video.
+### 3. Prepare Video Masks
 Generate masks for your videos using either [Grounded-SAM](https://github.com/IDEA-Research/Grounded-Segment-Anything) or [SAM 2](https://github.com/facebookresearch/segment-anything).
+We have provided the example masks of one video in `videos/NTU_RGB_D/A58-shaking_hands_masks`.
+### 4. Start Training
 ```bash
 CUDA_VISIBLE_DEVICES=0 COMPILE_DIT=1 python demos/fine_tuner/train.py --config-path demos/fine_tuner/configs/shaking_hands/example.yaml
 **Training Tips:**
 *   Generally, `1800 to 2400` training steps are sufficient. However, we recommend adjusting the training steps based on different relations to achieve optimal results.
 *   Default hyperparameters are in `demos/fine_tuner/configs/high-five/example.yaml`. For various relations, adjust `total_positive_nums` and `total_negative_nums` in relational contrastive learning for better results.
+*   A GPU with 80 GB VRAM is recommended for training. If you have limited VRAM, you can reduce the number of frames in the `Preprocess Videos` step. The default number of frames for training is 61.
 *   The current code primarily supports training with a batch size of 1. If you intend to use a batch size greater than 1, careful code review and modifications will be necessary.
   journal={arXiv preprint arXiv:2503.07602},
   year={2025}
 }
+```