BowenXue commited on
Commit
262f98e
·
verified ·
1 Parent(s): 2315d78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +179 -3
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ <h1>
4
+ <img src="assets/Stand-In.png" width="85" alt="Logo" valign="middle">
5
+ Stand-In
6
+ </h1>
7
+
8
+ <h3>A Lightweight and Plug-and-Play Identity Control for Video Generation</h3>
9
+
10
+
11
+
12
+ [![arXiv](https://img.shields.io/badge/arXiv-2508.07901-b31b1b)](https://arxiv.org/abs/2508.07901)
13
+ [![Project Page](https://img.shields.io/badge/Project_Page-Link-green)](https://www.stand-in.tech)
14
+ [![🤗 HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Model-orange)](https://huggingface.co/BowenXue/Stand-In)
15
+
16
+ </div>
17
+
18
+ <img width="5333" height="2983" alt="Image" src="https://github.com/user-attachments/assets/2fe1e505-bcf7-4eb6-8628-f23e70020966" />
19
+
20
+ > **Stand-In** is a lightweight, plug-and-play framework for identity-preserving video generation. By training only **1%** additional parameters compared to the base video generation model, we achieve state-of-the-art results in both Face Similarity and Naturalness, outperforming various full-parameter training methods. Moreover, **Stand-In** can be seamlessly integrated into other tasks such as subject-driven video generation, pose-controlled video generation, video stylization, and face swapping.
21
+
22
+ ---
23
+
24
+ ## 🔥 News
25
+ * **[2025.08.09]** Released Stand-In v1.0 (153M), the Wan2.1-14B-T2V–adapted weights and inference code are now open-sourced.
26
+
27
+ ---
28
+
29
+ ## 🌟 Showcase
30
+
31
+ ### Identity-Preserving Text-to-Video Generation
32
+
33
+ | Reference Image | Prompt | Generated Video |
34
+ | :---: | :---: | :---: |
35
+ |![Image](https://github.com/user-attachments/assets/86ce50d7-8ccb-45bf-9538-aea7f167a541)| "In a corridor where the walls ripple like water, a woman reaches out to touch the flowing surface, causing circles of ripples to spread. The camera moves from a medium shot to a close-up, capturing her curious expression as she sees her distorted reflection." |![Image](https://github.com/user-attachments/assets/c3c80bbf-a1cc-46a1-b47b-1b28bcad34a3) |
36
+ |![Image](https://github.com/user-attachments/assets/de10285e-7983-42bb-8534-80ac02210172)| "A young man dressed in traditional attire draws the long sword from his waist and begins to wield it. The blade flashes with light as he moves—his eyes sharp, his actions swift and powerful, with his flowing robes dancing in the wind." |![Image](https://github.com/user-attachments/assets/1532c701-ef01-47be-86da-d33c8c6894ab)|
37
+
38
+ ---
39
+ ### Non-Human Subjects-Preserving Video Generation
40
+
41
+ | Reference Image | Prompt | Generated Video |
42
+ | :---: | :---: | :---: |
43
+ |<img width="415" height="415" alt="Image" src="https://github.com/user-attachments/assets/b929444d-d724-4cf9-b422-be82b380ff78" />|"A chibi-style boy speeding on a skateboard, holding a detective novel in one hand. The background features city streets, with trees, streetlights, and billboards along the roads."|![Image](https://github.com/user-attachments/assets/a7239232-77bc-478b-a0d9-ecc77db97aa5) |
44
+
45
+ ---
46
+
47
+ ### Identity-Preserving Stylized Video Generation
48
+
49
+ | Reference Image | LoRA | Generated Video |
50
+ | :---: | :---: | :---: |
51
+ |![Image](https://github.com/user-attachments/assets/9c0687f9-e465-4bc5-bc62-8ac46d5f38b1)|Ghibli LoRA|![Image](https://github.com/user-attachments/assets/c6ca1858-de39-4fff-825a-26e6d04e695f)|
52
+ ---
53
+
54
+ ### Video Face Swapping
55
+
56
+ | Reference Video | Identity | Generated Video |
57
+ | :---: | :---: | :---: |
58
+ |![Image](https://github.com/user-attachments/assets/33370ac7-364a-4f97-8ba9-14e1009cd701)|<img width="415" height="415" alt="Image" src="https://github.com/user-attachments/assets/d2cd8da0-7aa0-4ee4-a61d-b52718c33756" />|![Image](https://github.com/user-attachments/assets/0db8aedd-411f-414a-9227-88f4e4050b50)|
59
+
60
+
61
+ ---
62
+
63
+ ### Pose-Guided Video Generation (With VACE)
64
+
65
+ | Reference Pose | First Frame | Generated Video |
66
+ | :---: | :---: | :---: |
67
+ |![Image](https://github.com/user-attachments/assets/5df5eec8-b71c-4270-8a78-906a488f9a94)|<img width="719" height="415" alt="Image" src="https://github.com/user-attachments/assets/1c2a69e1-e530-4164-848b-e7ea85a99763" />|![Image](https://github.com/user-attachments/assets/1c8a54da-01d6-43c1-a5fd-cab0c9e32c44)|
68
+
69
+ ---
70
+ ### For more results, please visit [https://stand-in-video.github.io/](https://www.Stand-In.tech)
71
+
72
+ ## 📖 Key Features
73
+ - Efficient Training: Only 1% of the base model parameters need to be trained.
74
+ - High Fidelity: Outstanding identity consistency without sacrificing video generation quality.
75
+ - Plug-and-Play: Easily integrates into existing T2V (Text-to-Video) models.
76
+ - Highly Extensible: Compatible with community models such as LoRA, and supports various downstream video tasks.
77
+
78
+ ---
79
+
80
+ ## ✅ Todo List
81
+ - [x] Release IP2V inference script (compatible with community LoRA).
82
+ - [x] Open-source model weights compatible with Wan2.1-14B-T2V: `Stand-In_Wan2.1-T2V-14B_153M_v1.0`。
83
+ - [ ] Open-source model weights compatible with Wan2.2-T2V-A14B.
84
+ - [ ] Release training dataset, data preprocessing scripts, and training code.
85
+
86
+ ---
87
+
88
+ ## 🚀 Quick Start
89
+
90
+ ### 1. Environment Setup
91
+ ```bash
92
+ # Clone the project repository
93
+ git clone https://github.com/KBRASK/Stand-In.git
94
+ cd Stand-In
95
+
96
+ # Create and activate Conda environment
97
+ conda create -n Stand-In python=3.11 -y
98
+ conda activate Stand-In
99
+
100
+ # Install dependencies
101
+ pip install -r requirements.txt
102
+
103
+ # (Optional) Install Flash Attention for faster inference
104
+ # Note: Make sure your GPU and CUDA version are compatible with Flash Attention
105
+ pip install flash-attn --no-build-isolation
106
+ ```
107
+
108
+ ### 2. Model Download
109
+ We provide an automatic download script that will fetch all required model weights into the `checkpoints` directory.
110
+ ```bash
111
+ python download_models.py
112
+ ```
113
+ This script will download the following models:
114
+ * `wan2.1-T2V-14B` (base text-to-video model)
115
+ * `antelopev2` (face recognition model)
116
+ * `Stand-In` (our Stand-In model)
117
+
118
+ > Note: If you already have the `wan2.1-T2V-14B model` locally, you can manually edit the `download_model.py` script to comment out the relevant download code and place the model in the `checkpoints/wan2.1-T2V-14B` directory.
119
+
120
+ ---
121
+
122
+ ## 🧪 Usage
123
+
124
+ ### Standard Inference
125
+
126
+ Use the `infer.py` script for standard identity-preserving text-to-video generation.
127
+
128
+ ```bash
129
+ python infer.py \
130
+ --prompt "A man sits comfortably at a desk, facing the camera as if talking to a friend or family member on the screen. His gaze is focused and gentle, with a natural smile. The background is his carefully decorated personal space, with photos and a world map on the wall, conveying a sense of intimate and modern communication." \
131
+ --ip_image "test/input/lecun.jpg" \
132
+ --output "test/output/lecun.mp4"
133
+ ```
134
+
135
+ ### Inference with Community LoRA
136
+
137
+ Use the `infer_with_lora.py` script to load one or more community LoRA models alongside Stand-In.
138
+
139
+ ```bash
140
+ python infer_with_lora.py \
141
+ --prompt "A man sits comfortably at a desk, facing the camera as if talking to a friend or family member on the screen. His gaze is focused and gentle, with a natural smile. The background is his carefully decorated personal space, with photos and a world map on the wall, conveying a sense of intimate and modern communication." \
142
+ --ip_image "test/input/lecun.jpg" \
143
+ --output "test/output/lecun.mp4" \
144
+ --lora_path "path/to/your/lora.safetensors" \
145
+ --lora_scale 1.0
146
+ ```
147
+
148
+ We recommend using this stylization LoRA: [https://civitai.com/models/1404755/studio-ghibli-wan21-t2v-14b](https://civitai.com/models/1404755/studio-ghibli-wan21-t2v-14b)
149
+
150
+ ---
151
+
152
+ ## 🤝 Acknowledgements
153
+
154
+ This project is built upon the following excellent open-source projects:
155
+ * [DiffSynth-Studio](https://github.com/modelscope/DiffSynth-Studio) (training/inference framework)
156
+ * [Wan2.1](https://github.com/Wan-Video/Wan2.1) (base video generation model)
157
+
158
+ We sincerely thank the authors and contributors of these projects.
159
+
160
+ ---
161
+
162
+ ## ✏ Citation
163
+
164
+ If you find our work helpful for your research, please consider citing our paper:
165
+
166
+ ```bibtex
167
+ @article{xue2025standin,
168
+ title={Stand-In: A Lightweight and Plug-and-Play Identity Control for Video Generation},
169
+ author={Xue, Bowen and Yan, Qixin and Wang, Wenjing and Liu, Hao and Li, Chen},
170
+ journal={arXiv preprint arXiv:2508.xxxxx},
171
+ year={2025},
172
+ }
173
+ ```
174
+
175
+ ---
176
+
177
+ ## 📬 Contact Us
178
+
179
+ If you have any questions or suggestions, feel free to reach out via [GitHub Issues](https://github.com/WeChatCV/Stand-In/issues) . We look forward to your feedback!