Diffusers
Safetensors
SsharvienKumar commited on
Commit
869a3ef
·
verified ·
1 Parent(s): 547acdb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -1
README.md CHANGED
@@ -1,3 +1,119 @@
1
  ---
2
  license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-4.0
3
+ ---
4
+ <div id="top" align="center">
5
+
6
+ # SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis (MICCAI 2025 - ORAL)
7
+ Ssharvien Kumar Sivakumar, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay
8
+
9
+ [![arXiv](https://img.shields.io/badge/arXiv-2506.03082-b31b1b.svg)](https://arxiv.org/abs/2506.03082)
10
+ [![Homepage](https://img.shields.io/badge/Homepage-Visit-blue)](https://ssharvienkumar.github.io/SG2VID/)
11
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/SsharvienKumar/SG2VID)
12
+
13
+ </div>
14
+
15
+ ## 💡Key Features
16
+ - First diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control.
17
+ - Outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy’s size and movement, entrance of new tools, as well as the overall scene layout.
18
+ - We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task.
19
+ - We showcase SG2VID’s ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.
20
+
21
+ ***This framework provides training scripts for the video diffusion model, supporting both unconditional and conditional training using signals such as the initial frame, scene graph, and text. Feel free to use our work for comparisons and to cite it!***
22
+
23
+ ## 🛠 Setup
24
+ ```bash
25
+ git clone https://github.com/MECLabTUDA/SG2VID.git
26
+ cd SG2VID
27
+ conda env create -f environment.yaml
28
+ conda activate sg2vid
29
+ ```
30
+
31
+ ## 🏁 Model Checkpoints and Dataset
32
+ Download the checkpoints of all the necessary models from the provided sources and place them in `[checkpoints](./checkpoints)`. We also provide the processed CATARACTS, Cataract-1K, Cholec80 dataset, containing images, segmentation masks and their scene graphs. Update the paths of the dataset in `[configs](./configs)`.
33
+ - `Checkpoints`: [VAEs, Graph Encoders, Video Diffusion Models](https://huggingface.co/SsharvienKumar/SG2VID/tree/main/checkpoints)
34
+ - `Processed Dataset`: [Frames, Segmentation Masks, Scene Graphs](https://huggingface.co/SsharvienKumar/SG2VID/tree/main/datasets)
35
+
36
+
37
+ ## 💥 Sampling Videos with SG2VID
38
+ Conditioned with initial frame and graph
39
+ ```bash
40
+ python sample.py --inference_config ./configs/inference/inference_img_graph_<dataset_name>.yaml
41
+ ```
42
+
43
+ Conditioned with only graph
44
+ ```bash
45
+ python sample.py --inference_config ./configs/inference/inference_ximg_graph_<dataset_name>.yaml
46
+ ```
47
+
48
+
49
+ ## ⏳ Training SG2VID
50
+ **Step 1:** Train Image VQGAN and Segmentation VQGAN (For Graph Encoders)
51
+ ```bash
52
+ python sg2vid/taming/main.py --base configs/vae/config_image_autoencoder_vqgan_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
53
+ python sg2vid/taming/main.py --base configs/vae/config_segmentation_autoencoder_vqgan_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
54
+ ```
55
+
56
+ **Step 2:** Train Another VAE (For Video Diffusion Model)
57
+ ```bash
58
+ python sg2vid/ldm/main.py --base configs/vae/config_autoencoderkl_<dataset_name>.yaml -t --gpus 0, --logdir checkpoints/<dataset_name>
59
+
60
+ # Converting a CompVis VAE to Diffusers VAE Format
61
+ # IMPORTANT: First update Diffusers to version 0.31.0, then downgrade back to 0.21.2
62
+ python scripts/ae_compvis_to_diffuser.py \
63
+ --vae_pt_path /path/to/checkpoints/last.ckpt \
64
+ --dump_path /path/to/save/vae_vid_diffusion
65
+ ```
66
+
67
+ **Step 3:** Train Both Graph Encoders
68
+ ```bash
69
+ python train_graph.py --name masked --config configs/graph/graph_<dataset_name>.yaml
70
+ python train_graph.py --name segclip --config configs/graph/graph_<dataset_name>.yaml
71
+ ```
72
+
73
+ **Step 4:** Train Video Diffusion Model
74
+
75
+ Single-GPU Setup
76
+ ```bash
77
+ python train.py --config configs/training/training_<cond_type>_<dataset_name>.yaml -n sg2vid_training
78
+ ```
79
+
80
+ Multi-GPU Setup (Single Node)
81
+ ```bash
82
+ python -m torch.distributed.run \
83
+ --nproc_per_node=${GPU_PER_NODE} \
84
+ --master_addr=127.0.0.1 \
85
+ --master_port=29501 \
86
+ --nnodes=1 \
87
+ --node_rank=0 \
88
+ train.py \
89
+ --config configs/training/training_<cond_type>_<dataset_name>.yaml \
90
+ -n sg2vid_training
91
+ ```
92
+
93
+
94
+ ## ⏳ Training Unconditional Video Diffusion Model
95
+ Single-GPU Setup
96
+ ```bash
97
+ python train.py --config configs/training/training_unconditional_<dataset_name>.yaml -n sg2vid_training
98
+ ```
99
+
100
+
101
+ ## 📜 Citations
102
+ If you are using SG2VID for your paper, please cite the following paper:
103
+ ```
104
+ @article{sivakumar2025sg2vid,
105
+ title={SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis},
106
+ author={Sivakumar, Ssharvien Kumar and Frisch, Yannik and Ghazaei, Ghazal and Mukhopadhyay, Anirban},
107
+ journal={arXiv preprint arXiv:2506.03082},
108
+ year={2025}
109
+ }
110
+ ```
111
+
112
+ ## ⭐ Acknowledgement
113
+ Thanks for the following projects and theoretical works that we have either used or inspired from:
114
+ - [SurGrID](https://github.com/SsharvienKumar/SurGrID)
115
+ - [ConsistI2V](https://github.com/TIGER-AI-Lab/ConsistI2V)
116
+ - [VQGAN](https://github.com/CompVis/taming-transformers)
117
+ - [LDM](https://github.com/CompVis/latent-diffusion)
118
+ - [SGDiff](https://github.com/YangLing0818/SGDiff)
119
+ - [Endora's README](https://github.com/CUHK-AIM-Group/Endora)