fateforward commited on
Commit
e97f66c
Β·
verified Β·
1 Parent(s): 768a7ae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -133
README.md CHANGED
@@ -1,163 +1,92 @@
1
- ---
2
- base_model:
3
- - THUDM/CogVideoX-5b
4
- - THUDM/CogVideoX1.5-5B-I2V
5
- datasets:
6
- - BestWishYsh/ConsisID-preview-Data
7
- language:
8
- - en
9
- library_name: diffusers
10
- license: apache-2.0
11
- pipeline_tag: text-to-video
12
- tags:
13
- - IPT2V
14
- base_model_relation: finetune
15
- ---
16
-
17
-
18
- <div align=center>
19
- <img src="https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/ConsisID_logo.png?raw=true" width="150px">
20
  </div>
21
 
22
- <h1 align="center"> <a href="https://pku-yuangroup.github.io/ConsisID">[CVPR 2025 Highlight] Identity-Preserving Text-to-Video Generation by Frequency Decomposition</a></h1>
23
-
24
- <p style="text-align: center;">
25
- <a href="https://huggingface.co/spaces/BestWishYsh/ConsisID-preview-Space">πŸ€— Huggingface Space</a> |
26
- <a href="https://pku-yuangroup.github.io/ConsisID">πŸ“„ Page </a> |
27
- <a href="https://github.com/PKU-YuanGroup/ConsisID">🌐 Github </a> |
28
- <a href="https://arxiv.org/abs/2411.17440">πŸ“œ arxiv </a> |
29
- <a href="https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data">🐳 Dataset</a>
30
- </p>
31
- <p align="center">
32
- <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update. </h5>
33
 
 
34
 
35
- ## 😍 Gallery
 
36
 
37
- Identity-Preserving Text-to-Video Generation. (Some best prompts [here](https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/prompt.xlsx))
38
- [![Demo Video of ConsisID](https://github.com/user-attachments/assets/634248f6-1b54-4963-88d6-34fa7263750b)](https://www.youtube.com/watch?v=PhlgC-bI5SQ)
39
- or you can click <a href="https://github.com/SHYuanBest/shyuanbest_media/raw/refs/heads/main/ConsisID/showcase_videos.mp4">here</a> to watch the video.
40
 
41
- ## πŸ€— Quick Start
 
 
42
 
43
- This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
 
44
 
45
- **We recommend that you visit our [GitHub](https://github.com/PKU-YuanGroup/ConsisID) and check out the relevant prompt
46
- optimizations and conversions to get a better experience.**
 
47
 
48
- 1. Install the required dependencies
49
 
50
- ```shell
51
- # ConsisID will be merged into diffusers in the next version. So for now, you should install from source.
52
- pip install --upgrade consisid_eva_clip pyfacer insightface facexlib transformers accelerate imageio-ffmpeg
53
- pip install git+https://github.com/huggingface/diffusers.git
54
- ```
55
 
56
- 2. Run the code
57
-
58
- ```python
59
- import torch
60
- from diffusers import ConsisIDPipeline
61
- from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
62
- from diffusers.utils import export_to_video
63
- from huggingface_hub import snapshot_download
64
-
65
- snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
66
- face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = (
67
- prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
68
- )
69
- pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
70
- pipe.to("cuda")
71
-
72
- # ConsisID works well with long and well-described prompts. Make sure the face in the image is clearly visible (e.g., preferably half-body or full-body).
73
- prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
74
- image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/2.png?raw=true"
75
-
76
- id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(
77
- face_helper_1,
78
- face_clip_model,
79
- face_helper_2,
80
- eva_transform_mean,
81
- eva_transform_std,
82
- face_main_model,
83
- "cuda",
84
- torch.bfloat16,
85
- image,
86
- is_align_face=True,
87
- )
88
-
89
- video = pipe(
90
- image=image,
91
- prompt=prompt,
92
- num_inference_steps=50,
93
- guidance_scale=6.0,
94
- use_dynamic_cfg=False,
95
- id_vit_hidden=id_vit_hidden,
96
- id_cond=id_cond,
97
- kps_cond=face_kps,
98
- generator=torch.Generator("cuda").manual_seed(42),
99
- )
100
- export_to_video(video.frames[0], "output.mp4", fps=8)
101
  ```
102
 
103
- ## πŸ› οΈ Prompt Refiner
104
 
105
- ConsisID has high requirements for prompt quality. You can use [GPT-4o](https://chatgpt.com/) to refine the input text prompt, an example is as follows (original prompt: "a man is playing guitar.")
106
  ```bash
107
- a man is playing guitar.
108
-
109
- Change the sentence above to something like this (add some facial changes, even if they are minor. Don't make the sentence too long):
110
-
111
- The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously. The airplane has a green stripe running along its side, and there is a large engine visible behind his. The man seems to be standing near the entrance of the airplane, possibly preparing to board or just having disembarked. The setting suggests that he might be at an airport or a private airfield. The overall atmosphere of the video is professional and focused, with the man's attire and the presence of the airplane indicating a business or travel context.
112
  ```
113
 
114
- Some sample prompts are available [here](https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/prompt.xlsx).
115
-
116
- ### πŸ’‘ GPU Memory Optimization
117
-
118
- ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
 
 
 
 
 
 
 
119
 
120
- | Feature (overlay the previous) | Max Memory Allocated | Max Memory Reserved |
121
- | :----------------------------- | :------------------- | :------------------ |
122
- | - | 37 GB | 44 GB |
123
- | enable_model_cpu_offload | 22 GB | 25 GB |
124
- | enable_sequential_cpu_offload | 16 GB | 22 GB |
125
- | vae.enable_slicing | 16 GB | 22 GB |
126
- | vae.enable_tiling | 5 GB | 7 GB |
127
 
128
  ```bash
129
- # turn on if you don't have multiple GPUs or enough GPU memory(such as H100)
130
- pipe.enable_model_cpu_offload()
131
- pipe.enable_sequential_cpu_offload()
132
- pipe.vae.enable_slicing()
133
- pipe.vae.enable_tiling()
134
  ```
135
 
136
- warning: it will cost more time in inference and may also reduce the quality.
137
-
138
- ## πŸ™Œ Description
139
 
140
- - **Repository:** [Code](https://github.com/PKU-YuanGroup/ConsisID), [Page](https://pku-yuangroup.github.io/ConsisID/), [Data](https://huggingface.co/datasets/BestWishYsh/ConsisID-preview-Data)
141
- - **Paper:** [https://huggingface.co/papers/2411.17440](https://huggingface.co/papers/2411.17440)
142
- - **Point of Contact:** [Shenghai Yuan](shyuan-cs@hotmail.com)
143
 
144
- ## ✏️ Citation
145
- If you find our paper and code useful in your research, please consider giving a star and citation.
146
 
147
- ```BibTeX
148
- @inproceedings{yuan2025identity,
149
- title={Identity-preserving text-to-video generation by frequency decomposition},
150
- author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyang and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
151
- booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
152
- pages={12978--12988},
 
153
  year={2025}
154
  }
155
  ```
156
 
157
- ## 🀝 Contributors
158
-
159
- <a href="https://github.com/PKU-YuanGroup/ConsisID/graphs/contributors">
160
- <img src="https://contrib.rocks/image?repo=PKU-YuanGroup/ConsisID&anon=true" />
161
 
162
- </a>
163
- ```
 
1
+ <div align ="center">
2
+ <h1> Proteus-ID </h1>
3
+ <h3> Proteus-ID: ID-Consistent and Motion-Coherent Video Customization </h3>
4
+ <div align="center">
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  </div>
6
 
7
+ [![Project Website](https://img.shields.io/badge/Project-Website-blue)](https://grenoble-zhang.github.io/Proteus-ID/)&nbsp;
8
+ [![arXiv](https://img.shields.io/badge/arXiv-2506.23729-b31b1b.svg)](https://arxiv.org/abs/2506.23729)&nbsp;
9
+ </div>
 
 
 
 
 
 
 
 
10
 
11
+ Authors: [Guiyu Zhang](https://grenoble-zhang.github.io/)<sup>1</sup>, [Chen Shi](https://scholar.google.com.hk/citations?user=o-K_AoYAAAAJ&hl=en)<sup>1</sup>, Zijian Jiang<sup>1</sup>, Xunzhi Xiang<sup>2</sup>, Jingjing Qian<sup>1</sup>, [Shaoshuai Shi](https://shishaoshuai.com/)<sup>3</sup>, [Li Jiang†](https://llijiang.github.io/)<sup>1</sup>
12
 
13
+ <sup>1</sup> The Chinese University of Hong Kong, Shenzhen&emsp;<sup>2</sup> Nanjing University&emsp;
14
+ <sup>3</sup> Voyager Research, Didi Chuxing
15
 
16
+ ## TODO
 
 
17
 
18
+ - [x] Release arXiv technique report
19
+ - [x] Release full codes
20
+ - [ ] Release dataset (coming soon)
21
 
22
+ ## πŸ› οΈ Requirements and Installation
23
+ ### Environment
24
 
25
+ ```bash
26
+ # 0. Clone the repo
27
+ git clone --depth=1 https://github.com/grenoble-zhang/Proteus-ID.git
28
 
29
+ cd /nfs/dataset-ofs-voyager-research/guiyuzhang/Opensource/code/Proteus-ID-main
30
 
31
+ # 1. Create conda environment
32
+ conda create -n proteusid python=3.11.0
33
+ conda activate proteusid
 
 
34
 
35
+ # 3. Install PyTorch and other dependencies
36
+ # CUDA 12.6
37
+ pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
38
+ # 4. Install pip dependencies
39
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```
41
 
42
+ ### Download Model
43
 
 
44
  ```bash
45
+ cd util
46
+ python download_weights.py
47
+ python down_raft.py
 
 
48
  ```
49
 
50
+ Once ready, the weights will be organized in this format:
51
+ ```
52
+ πŸ”¦ ckpts/
53
+ β”œβ”€β”€ πŸ“‚ face_encoder/
54
+ β”œβ”€β”€ πŸ“‚ scheduler/
55
+ β”œβ”€β”€ πŸ“‚ text_encoder/
56
+ β”œβ”€β”€ πŸ“‚ tokenizer/
57
+ β”œβ”€β”€ πŸ“‚ transformer/
58
+ β”œβ”€β”€ πŸ“‚ vae/
59
+ β”œβ”€β”€ πŸ“„ configuration.json
60
+ β”œβ”€β”€ πŸ“„ model_index.json
61
+ ```
62
 
63
+ ## πŸ‹οΈ Training
 
 
 
 
 
 
64
 
65
  ```bash
66
+ # For single rank
67
+ bash train_single_rank.sh
68
+ # For multi rank
69
+ bash train_multi_rank.sh
 
70
  ```
71
 
72
+ ## πŸ„οΈ Inference
 
 
73
 
74
+ ```bash
75
+ python inference.py --img_file_path assets/example_images/1.png --json_file_path assets/example_images/1.json
76
+ ```
77
 
 
 
78
 
79
+ ## BibTeX
80
+ If you find our work useful in your research, please consider citing our paper:
81
+ ```bibtex
82
+ @article{zhang2025proteus,
83
+ title={Proteus-ID: ID-Consistent and Motion-Coherent Video Customization},
84
+ author={Zhang, Guiyu and Shi, Chen and Jiang, Zijian and Xiang, Xunzhi and Qian, Jingjing and Shi, Shaoshuai and Jiang, Li},
85
+ journal={arXiv preprint arXiv:2506.23729},
86
  year={2025}
87
  }
88
  ```
89
 
90
+ ## Acknowledgement
 
 
 
91
 
92
+ Thansk for these excellent opensource works and models: [CogVideoX](https://github.com/THUDM/CogVideo); [ConsisID](https://github.com/PKU-YuanGroup/ConsisID); [diffusers](https://github.com/huggingface/diffusers).