nielsr HF Staff commited on
Commit
4e04f5e
Β·
verified Β·
1 Parent(s): 1a7670f

Add library name, pipeline tag and license

Browse files

This PR adds the library name, pipeline tag and license for the VideoPainter model.

Files changed (1) hide show
  1. README.md +8 -27
README.md CHANGED
@@ -1,11 +1,14 @@
1
  ---
2
- language:
3
- - en
4
  base_model:
5
  - THUDM/CogVideoX-5b
6
  - THUDM/CogVideoX-5b-I2V
7
  - THUDM/CogVideoX1.5-5B
8
  - THUDM/CogVideoX1.5-5B-I2V
 
 
 
 
 
9
  tags:
10
  - video
11
  - video inpainting
@@ -21,8 +24,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
21
  > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‑</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2βœ‰</sup><br>
22
  > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‑</sup>Project Lead <sup>βœ‰</sup>Corresponding Author
23
 
24
-
25
-
26
  <p align="center">
27
  <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
28
  <a href="https://arxiv.org/abs/2503.05639">πŸ“œArxiv</a> |
@@ -31,10 +32,8 @@ Keywords: Video Inpainting, Video Editing, Video Generation
31
  <a href="https://huggingface.co/TencentARC/VideoPainter">πŸ€—Hugging Face Model</a> |
32
  </p>
33
 
34
-
35
  **πŸ“– Table of Contents**
36
 
37
-
38
  - [VideoPainter](#videopainter)
39
  - [πŸ”₯ Update Log](#-update-log)
40
  - [πŸ“Œ TODO](#todo)
@@ -49,8 +48,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
49
  - [🀝🏼 Cite Us](#-cite-us)
50
  - [πŸ’– Acknowledgement](#-acknowledgement)
51
 
52
-
53
-
54
  ## πŸ”₯ Update Log
55
  - [2025/3/09] πŸ“’ πŸ“’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
56
  - [2025/3/09] πŸ“’ πŸ“’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
@@ -68,13 +65,10 @@ Keywords: Video Inpainting, Video Editing, Video Generation
68
  We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
69
  ![](assets/method.jpg)
70
 
71
-
72
-
73
  ## πŸš€ Getting Started
74
 
75
  ### Environment Requirement 🌍
76
 
77
-
78
  Clone the repo:
79
 
80
  ```
@@ -83,7 +77,6 @@ git clone https://github.com/TencentARC/VideoPainter.git
83
 
84
  We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
85
 
86
-
87
  ```
88
  conda create -n videopainter python=3.10 -y
89
  conda activate videopainter
@@ -112,7 +105,6 @@ pip install -e .
112
 
113
  ### Data Download ⬇️
114
 
115
-
116
  **VPBench and VPData**
117
 
118
  You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
@@ -186,7 +178,6 @@ cd data_utils
186
  python VPData_download.py
187
  ```
188
 
189
-
190
  **Checkpoints**
191
 
192
  Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
@@ -217,7 +208,6 @@ git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
217
  mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
218
  ```
219
 
220
-
221
  The ckpt structure should be like:
222
 
223
  ```
@@ -240,10 +230,8 @@ The ckpt structure should be like:
240
  |-- ...
241
  ```
242
 
243
-
244
  ## πŸƒπŸΌ Running Scripts
245
 
246
-
247
  ### Training 🀯
248
 
249
  You can train the VideoPainter using the script:
@@ -262,7 +250,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
262
  export TOKENIZERS_PARALLELISM=false
263
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
264
 
265
- accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine_rank 0 \
266
  train_cogvideox_inpainting_i2v_video.py \
267
  --pretrained_model_name_or_path $MODEL_PATH \
268
  --cache_dir $CACHE_PATH \
@@ -329,7 +317,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
329
  export TOKENIZERS_PARALLELISM=false
330
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
331
 
332
- accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
333
  train_cogvideox_inpainting_i2v_video_resample.py \
334
  --pretrained_model_name_or_path $MODEL_PATH \
335
  --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
@@ -388,9 +376,6 @@ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml
388
  --id_pool_resample_learnable
389
  ```
390
 
391
-
392
-
393
-
394
  ### Inference πŸ“œ
395
 
396
  You can inference for the video inpainting or editing with the script:
@@ -412,7 +397,6 @@ bash edit_bench.sh
412
 
413
  Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
414
 
415
-
416
  You can also inference through gradio demo:
417
 
418
  ```
@@ -424,7 +408,6 @@ CUDA_VISIBLE_DEVICES=0 python app.py \
424
  --img_inpainting_model ../ckpt/flux_inp
425
  ```
426
 
427
-
428
  ### Evaluation πŸ“
429
 
430
  You can evaluate using the script:
@@ -441,7 +424,6 @@ bash eval_edit.sh
441
  bash eval_editing_id_resample.sh
442
  ```
443
 
444
-
445
  ## 🀝🏼 Cite Us
446
 
447
  ```
@@ -456,8 +438,7 @@ bash eval_editing_id_resample.sh
456
  }
457
  ```
458
 
459
-
460
  ## πŸ’– Acknowledgement
461
  <span id="acknowledgement"></span>
462
 
463
- Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!
 
1
  ---
 
 
2
  base_model:
3
  - THUDM/CogVideoX-5b
4
  - THUDM/CogVideoX-5b-I2V
5
  - THUDM/CogVideoX1.5-5B
6
  - THUDM/CogVideoX1.5-5B-I2V
7
+ library_name: diffusers
8
+ license: mit
9
+ pipeline_tag: image-to-video
10
+ language:
11
+ - en
12
  tags:
13
  - video
14
  - video inpainting
 
24
  > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‑</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2βœ‰</sup><br>
25
  > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‑</sup>Project Lead <sup>βœ‰</sup>Corresponding Author
26
 
 
 
27
  <p align="center">
28
  <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
29
  <a href="https://arxiv.org/abs/2503.05639">πŸ“œArxiv</a> |
 
32
  <a href="https://huggingface.co/TencentARC/VideoPainter">πŸ€—Hugging Face Model</a> |
33
  </p>
34
 
 
35
  **πŸ“– Table of Contents**
36
 
 
37
  - [VideoPainter](#videopainter)
38
  - [πŸ”₯ Update Log](#-update-log)
39
  - [πŸ“Œ TODO](#todo)
 
48
  - [🀝🏼 Cite Us](#-cite-us)
49
  - [πŸ’– Acknowledgement](#-acknowledgement)
50
 
 
 
51
  ## πŸ”₯ Update Log
52
  - [2025/3/09] πŸ“’ πŸ“’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
53
  - [2025/3/09] πŸ“’ πŸ“’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
 
65
  We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
66
  ![](assets/method.jpg)
67
 
 
 
68
  ## πŸš€ Getting Started
69
 
70
  ### Environment Requirement 🌍
71
 
 
72
  Clone the repo:
73
 
74
  ```
 
77
 
78
  We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
79
 
 
80
  ```
81
  conda create -n videopainter python=3.10 -y
82
  conda activate videopainter
 
105
 
106
  ### Data Download ⬇️
107
 
 
108
  **VPBench and VPData**
109
 
110
  You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
 
178
  python VPData_download.py
179
  ```
180
 
 
181
  **Checkpoints**
182
 
183
  Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
 
208
  mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
209
  ```
210
 
 
211
  The ckpt structure should be like:
212
 
213
  ```
 
230
  |-- ...
231
  ```
232
 
 
233
  ## πŸƒπŸΌ Running Scripts
234
 
 
235
  ### Training 🀯
236
 
237
  You can train the VideoPainter using the script:
 
250
  export TOKENIZERS_PARALLELISM=false
251
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
252
 
253
+ accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine-rank 0 \
254
  train_cogvideox_inpainting_i2v_video.py \
255
  --pretrained_model_name_or_path $MODEL_PATH \
256
  --cache_dir $CACHE_PATH \
 
317
  export TOKENIZERS_PARALLELISM=false
318
  export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
319
 
320
+ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine-rank 0 \
321
  train_cogvideox_inpainting_i2v_video_resample.py \
322
  --pretrained_model_name_or_path $MODEL_PATH \
323
  --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
 
376
  --id_pool_resample_learnable
377
  ```
378
 
 
 
 
379
  ### Inference πŸ“œ
380
 
381
  You can inference for the video inpainting or editing with the script:
 
397
 
398
  Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
399
 
 
400
  You can also inference through gradio demo:
401
 
402
  ```
 
408
  --img_inpainting_model ../ckpt/flux_inp
409
  ```
410
 
 
411
  ### Evaluation πŸ“
412
 
413
  You can evaluate using the script:
 
424
  bash eval_editing_id_resample.sh
425
  ```
426
 
 
427
  ## 🀝🏼 Cite Us
428
 
429
  ```
 
438
  }
439
  ```
440
 
 
441
  ## πŸ’– Acknowledgement
442
  <span id="acknowledgement"></span>
443
 
444
+ Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!