Add library name, pipeline tag and license

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +8 -27
README.md CHANGED
@@ -1,11 +1,14 @@
1
  ---
2
- language:
3
- - en
4
  base_model:
5
  - THUDM/CogVideoX-5b
6
  - THUDM/CogVideoX-5b-I2V
7
  - THUDM/CogVideoX1.5-5B
8
  - THUDM/CogVideoX1.5-5B-I2V
 
 
 
 
 
9
  tags:
10
  - video
11
  - video inpainting
@@ -21,8 +24,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
21
  > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‑</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2βœ‰</sup><br>
22
  > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‑</sup>Project Lead <sup>βœ‰</sup>Corresponding Author
23
 
24
-
25
-
26
  <p align="center">
27
  <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
28
  <a href="https://arxiv.org/abs/2503.05639">πŸ“œArxiv</a> |
@@ -31,10 +32,8 @@ Keywords: Video Inpainting, Video Editing, Video Generation
31
  <a href="https://huggingface.co/TencentARC/VideoPainter">πŸ€—Hugging Face Model</a> |
32
  </p>
33
 
34
-
35
  **πŸ“– Table of Contents**
36
 
37
-
38
  - [VideoPainter](#videopainter)
39
  - [πŸ”₯ Update Log](#-update-log)
40
  - [πŸ“Œ TODO](#todo)
@@ -49,8 +48,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
49
  - [🀝🏼 Cite Us](#-cite-us)
50
  - [πŸ’– Acknowledgement](#-acknowledgement)
51
 
52
-
53
-
54
  ## πŸ”₯ Update Log
55
  - [2025/3/09] πŸ“’ πŸ“’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
56
  - [2025/3/09] πŸ“’ πŸ“’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
@@ -68,13 +65,10 @@ Keywords: Video Inpainting, Video Editing, Video Generation
68
  We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
69
  ![](assets/method.jpg)
70
 
71
-
72
-
73
  ## πŸš€ Getting Started
74
 
75
  ### Environment Requirement 🌍
76
 
77
-
78
  Clone the repo:
79
 
80
  ```
@@ -83,7 +77,6 @@ git clone https://github.com/TencentARC/VideoPainter.git
83
 
84
  We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
85
 
86
-
87
  ```
88
  conda create -n videopainter python=3.10 -y
89
  conda activate videopainter
@@ -112,7 +105,6 @@ pip install -e .
112
 
113
  ### Data Download ⬇️
114
 
115
-
116
  **VPBench and VPData**
117
 
118
  You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
@@ -186,7 +178,6 @@ cd data_utils
186
  python VPData_download.py
187
  ```
188
 
189
-
190
  **Checkpoints**
191
 
192
  Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
@@ -217,7 +208,6 @@ git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
217
  mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
218
  ```
219
 
220
-
221
  The ckpt structure should be like:
222
 
223
  ```
@@ -240,10 +230,8 @@ The ckpt structure should be like:
240
  |-- ...
241
  ```
242
 
243
-
244
  ## πŸƒπŸΌ Running Scripts
245
 
246
-
247
  ### Training 🀯
248
 
249
  You can train the VideoPainter using the script:
@@ -262,7 +250,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
262
  export TOKENIZERS_PARALLELISM=false
263
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
264
 
265
- accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine_rank 0 \
266
  train_cogvideox_inpainting_i2v_video.py \
267
  --pretrained_model_name_or_path $MODEL_PATH \
268
  --cache_dir $CACHE_PATH \
@@ -329,7 +317,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
329
  export TOKENIZERS_PARALLELISM=false
330
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
331
 
332
- accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
333
  train_cogvideox_inpainting_i2v_video_resample.py \
334
  --pretrained_model_name_or_path $MODEL_PATH \
335
  --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
@@ -388,9 +376,6 @@ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml
388
  --id_pool_resample_learnable
389
  ```
390
 
391
-
392
-
393
-
394
  ### Inference πŸ“œ
395
 
396
  You can inference for the video inpainting or editing with the script:
@@ -412,7 +397,6 @@ bash edit_bench.sh
412
 
413
  Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
414
 
415
-
416
  You can also inference through gradio demo:
417
 
418
  ```
@@ -424,7 +408,6 @@ CUDA_VISIBLE_DEVICES=0 python app.py \
424
  --img_inpainting_model ../ckpt/flux_inp
425
  ```
426
 
427
-
428
  ### Evaluation πŸ“
429
 
430
  You can evaluate using the script:
@@ -441,7 +424,6 @@ bash eval_edit.sh
441
  bash eval_editing_id_resample.sh
442
  ```
443
 
444
-
445
  ## 🀝🏼 Cite Us
446
 
447
  ```
@@ -456,8 +438,7 @@ bash eval_editing_id_resample.sh
456
  }
457
  ```
458
 
459
-
460
  ## πŸ’– Acknowledgement
461
  <span id="acknowledgement"></span>
462
 
463
- Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!
 
1
  ---
 
 
2
  base_model:
3
  - THUDM/CogVideoX-5b
4
  - THUDM/CogVideoX-5b-I2V
5
  - THUDM/CogVideoX1.5-5B
6
  - THUDM/CogVideoX1.5-5B-I2V
7
+ library_name: diffusers
8
+ license: mit
9
+ pipeline_tag: image-to-video
10
+ language:
11
+ - en
12
  tags:
13
  - video
14
  - video inpainting
 
24
  > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‑</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2βœ‰</sup><br>
25
  > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‑</sup>Project Lead <sup>βœ‰</sup>Corresponding Author
26
 
 
 
27
  <p align="center">
28
  <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
29
  <a href="https://arxiv.org/abs/2503.05639">πŸ“œArxiv</a> |
 
32
  <a href="https://huggingface.co/TencentARC/VideoPainter">πŸ€—Hugging Face Model</a> |
33
  </p>
34
 
 
35
  **πŸ“– Table of Contents**
36
 
 
37
  - [VideoPainter](#videopainter)
38
  - [πŸ”₯ Update Log](#-update-log)
39
  - [πŸ“Œ TODO](#todo)
 
48
  - [🀝🏼 Cite Us](#-cite-us)
49
  - [πŸ’– Acknowledgement](#-acknowledgement)
50
 
 
 
51
  ## πŸ”₯ Update Log
52
  - [2025/3/09] πŸ“’ πŸ“’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
53
  - [2025/3/09] πŸ“’ πŸ“’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
 
65
  We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
66
  ![](assets/method.jpg)
67
 
 
 
68
  ## πŸš€ Getting Started
69
 
70
  ### Environment Requirement 🌍
71
 
 
72
  Clone the repo:
73
 
74
  ```
 
77
 
78
  We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
79
 
 
80
  ```
81
  conda create -n videopainter python=3.10 -y
82
  conda activate videopainter
 
105
 
106
  ### Data Download ⬇️
107
 
 
108
  **VPBench and VPData**
109
 
110
  You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
 
178
  python VPData_download.py
179
  ```
180
 
 
181
  **Checkpoints**
182
 
183
  Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
 
208
  mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
209
  ```
210
 
 
211
  The ckpt structure should be like:
212
 
213
  ```
 
230
  |-- ...
231
  ```
232
 
 
233
  ## πŸƒπŸΌ Running Scripts
234
 
 
235
  ### Training 🀯
236
 
237
  You can train the VideoPainter using the script:
 
250
  export TOKENIZERS_PARALLELISM=false
251
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
252
 
253
+ accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine-rank 0 \
254
  train_cogvideox_inpainting_i2v_video.py \
255
  --pretrained_model_name_or_path $MODEL_PATH \
256
  --cache_dir $CACHE_PATH \
 
317
  export TOKENIZERS_PARALLELISM=false
318
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
319
 
320
+ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine-rank 0 \
321
  train_cogvideox_inpainting_i2v_video_resample.py \
322
  --pretrained_model_name_or_path $MODEL_PATH \
323
  --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
 
376
  --id_pool_resample_learnable
377
  ```
378
 
 
 
 
379
  ### Inference πŸ“œ
380
 
381
  You can inference for the video inpainting or editing with the script:
 
397
 
398
  Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
399
 
 
400
  You can also inference through gradio demo:
401
 
402
  ```
 
408
  --img_inpainting_model ../ckpt/flux_inp
409
  ```
410
 
 
411
  ### Evaluation πŸ“
412
 
413
  You can evaluate using the script:
 
424
  bash eval_editing_id_resample.sh
425
  ```
426
 
 
427
  ## 🀝🏼 Cite Us
428
 
429
  ```
 
438
  }
439
  ```
440
 
 
441
  ## πŸ’– Acknowledgement
442
  <span id="acknowledgement"></span>
443
 
444
+ Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!