Add library name, pipeline tag and license
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,11 +1,14 @@
|
|
| 1 |
---
|
| 2 |
-
language:
|
| 3 |
-
- en
|
| 4 |
base_model:
|
| 5 |
- THUDM/CogVideoX-5b
|
| 6 |
- THUDM/CogVideoX-5b-I2V
|
| 7 |
- THUDM/CogVideoX1.5-5B
|
| 8 |
- THUDM/CogVideoX1.5-5B-I2V
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
tags:
|
| 10 |
- video
|
| 11 |
- video inpainting
|
|
@@ -21,8 +24,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
|
|
| 21 |
> [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1β‘</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2β</sup><br>
|
| 22 |
> <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>β‘</sup>Project Lead <sup>β</sup>Corresponding Author
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
<p align="center">
|
| 27 |
<a href="https://yxbian23.github.io/project/video-painter">πProject Page</a> |
|
| 28 |
<a href="https://arxiv.org/abs/2503.05639">πArxiv</a> |
|
|
@@ -31,10 +32,8 @@ Keywords: Video Inpainting, Video Editing, Video Generation
|
|
| 31 |
<a href="https://huggingface.co/TencentARC/VideoPainter">π€Hugging Face Model</a> |
|
| 32 |
</p>
|
| 33 |
|
| 34 |
-
|
| 35 |
**π Table of Contents**
|
| 36 |
|
| 37 |
-
|
| 38 |
- [VideoPainter](#videopainter)
|
| 39 |
- [π₯ Update Log](#-update-log)
|
| 40 |
- [π TODO](#todo)
|
|
@@ -49,8 +48,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
|
|
| 49 |
- [π€πΌ Cite Us](#-cite-us)
|
| 50 |
- [π Acknowledgement](#-acknowledgement)
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
## π₯ Update Log
|
| 55 |
- [2025/3/09] π’ π’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
|
| 56 |
- [2025/3/09] π’ π’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
|
|
@@ -68,13 +65,10 @@ Keywords: Video Inpainting, Video Editing, Video Generation
|
|
| 68 |
We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
|
| 69 |

|
| 70 |
|
| 71 |
-
|
| 72 |
-
|
| 73 |
## π Getting Started
|
| 74 |
|
| 75 |
### Environment Requirement π
|
| 76 |
|
| 77 |
-
|
| 78 |
Clone the repo:
|
| 79 |
|
| 80 |
```
|
|
@@ -83,7 +77,6 @@ git clone https://github.com/TencentARC/VideoPainter.git
|
|
| 83 |
|
| 84 |
We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
|
| 85 |
|
| 86 |
-
|
| 87 |
```
|
| 88 |
conda create -n videopainter python=3.10 -y
|
| 89 |
conda activate videopainter
|
|
@@ -112,7 +105,6 @@ pip install -e .
|
|
| 112 |
|
| 113 |
### Data Download β¬οΈ
|
| 114 |
|
| 115 |
-
|
| 116 |
**VPBench and VPData**
|
| 117 |
|
| 118 |
You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
|
|
@@ -186,7 +178,6 @@ cd data_utils
|
|
| 186 |
python VPData_download.py
|
| 187 |
```
|
| 188 |
|
| 189 |
-
|
| 190 |
**Checkpoints**
|
| 191 |
|
| 192 |
Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
|
|
@@ -217,7 +208,6 @@ git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
|
|
| 217 |
mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
|
| 218 |
```
|
| 219 |
|
| 220 |
-
|
| 221 |
The ckpt structure should be like:
|
| 222 |
|
| 223 |
```
|
|
@@ -240,10 +230,8 @@ The ckpt structure should be like:
|
|
| 240 |
|-- ...
|
| 241 |
```
|
| 242 |
|
| 243 |
-
|
| 244 |
## ππΌ Running Scripts
|
| 245 |
|
| 246 |
-
|
| 247 |
### Training π€―
|
| 248 |
|
| 249 |
You can train the VideoPainter using the script:
|
|
@@ -262,7 +250,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
|
| 262 |
export TOKENIZERS_PARALLELISM=false
|
| 263 |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
| 264 |
|
| 265 |
-
accelerate launch --config_file accelerate_config_machine_single_ds.yaml --
|
| 266 |
train_cogvideox_inpainting_i2v_video.py \
|
| 267 |
--pretrained_model_name_or_path $MODEL_PATH \
|
| 268 |
--cache_dir $CACHE_PATH \
|
|
@@ -329,7 +317,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
|
|
| 329 |
export TOKENIZERS_PARALLELISM=false
|
| 330 |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
| 331 |
|
| 332 |
-
accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --
|
| 333 |
train_cogvideox_inpainting_i2v_video_resample.py \
|
| 334 |
--pretrained_model_name_or_path $MODEL_PATH \
|
| 335 |
--cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
|
|
@@ -388,9 +376,6 @@ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml
|
|
| 388 |
--id_pool_resample_learnable
|
| 389 |
```
|
| 390 |
|
| 391 |
-
|
| 392 |
-
|
| 393 |
-
|
| 394 |
### Inference π
|
| 395 |
|
| 396 |
You can inference for the video inpainting or editing with the script:
|
|
@@ -412,7 +397,6 @@ bash edit_bench.sh
|
|
| 412 |
|
| 413 |
Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
|
| 414 |
|
| 415 |
-
|
| 416 |
You can also inference through gradio demo:
|
| 417 |
|
| 418 |
```
|
|
@@ -424,7 +408,6 @@ CUDA_VISIBLE_DEVICES=0 python app.py \
|
|
| 424 |
--img_inpainting_model ../ckpt/flux_inp
|
| 425 |
```
|
| 426 |
|
| 427 |
-
|
| 428 |
### Evaluation π
|
| 429 |
|
| 430 |
You can evaluate using the script:
|
|
@@ -441,7 +424,6 @@ bash eval_edit.sh
|
|
| 441 |
bash eval_editing_id_resample.sh
|
| 442 |
```
|
| 443 |
|
| 444 |
-
|
| 445 |
## π€πΌ Cite Us
|
| 446 |
|
| 447 |
```
|
|
@@ -456,8 +438,7 @@ bash eval_editing_id_resample.sh
|
|
| 456 |
}
|
| 457 |
```
|
| 458 |
|
| 459 |
-
|
| 460 |
## π Acknowledgement
|
| 461 |
<span id="acknowledgement"></span>
|
| 462 |
|
| 463 |
-
Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- THUDM/CogVideoX-5b
|
| 4 |
- THUDM/CogVideoX-5b-I2V
|
| 5 |
- THUDM/CogVideoX1.5-5B
|
| 6 |
- THUDM/CogVideoX1.5-5B-I2V
|
| 7 |
+
library_name: diffusers
|
| 8 |
+
license: mit
|
| 9 |
+
pipeline_tag: image-to-video
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
tags:
|
| 13 |
- video
|
| 14 |
- video inpainting
|
|
|
|
| 24 |
> [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1β‘</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2β</sup><br>
|
| 25 |
> <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>β‘</sup>Project Lead <sup>β</sup>Corresponding Author
|
| 26 |
|
|
|
|
|
|
|
| 27 |
<p align="center">
|
| 28 |
<a href="https://yxbian23.github.io/project/video-painter">πProject Page</a> |
|
| 29 |
<a href="https://arxiv.org/abs/2503.05639">πArxiv</a> |
|
|
|
|
| 32 |
<a href="https://huggingface.co/TencentARC/VideoPainter">π€Hugging Face Model</a> |
|
| 33 |
</p>
|
| 34 |
|
|
|
|
| 35 |
**π Table of Contents**
|
| 36 |
|
|
|
|
| 37 |
- [VideoPainter](#videopainter)
|
| 38 |
- [π₯ Update Log](#-update-log)
|
| 39 |
- [π TODO](#todo)
|
|
|
|
| 48 |
- [π€πΌ Cite Us](#-cite-us)
|
| 49 |
- [π Acknowledgement](#-acknowledgement)
|
| 50 |
|
|
|
|
|
|
|
| 51 |
## π₯ Update Log
|
| 52 |
- [2025/3/09] π’ π’ [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
|
| 53 |
- [2025/3/09] π’ π’ [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
|
|
|
|
| 65 |
We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
|
| 66 |

|
| 67 |
|
|
|
|
|
|
|
| 68 |
## π Getting Started
|
| 69 |
|
| 70 |
### Environment Requirement π
|
| 71 |
|
|
|
|
| 72 |
Clone the repo:
|
| 73 |
|
| 74 |
```
|
|
|
|
| 77 |
|
| 78 |
We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
|
| 79 |
|
|
|
|
| 80 |
```
|
| 81 |
conda create -n videopainter python=3.10 -y
|
| 82 |
conda activate videopainter
|
|
|
|
| 105 |
|
| 106 |
### Data Download β¬οΈ
|
| 107 |
|
|
|
|
| 108 |
**VPBench and VPData**
|
| 109 |
|
| 110 |
You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
|
|
|
|
| 178 |
python VPData_download.py
|
| 179 |
```
|
| 180 |
|
|
|
|
| 181 |
**Checkpoints**
|
| 182 |
|
| 183 |
Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
|
|
|
|
| 208 |
mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
|
| 209 |
```
|
| 210 |
|
|
|
|
| 211 |
The ckpt structure should be like:
|
| 212 |
|
| 213 |
```
|
|
|
|
| 230 |
|-- ...
|
| 231 |
```
|
| 232 |
|
|
|
|
| 233 |
## ππΌ Running Scripts
|
| 234 |
|
|
|
|
| 235 |
### Training π€―
|
| 236 |
|
| 237 |
You can train the VideoPainter using the script:
|
|
|
|
| 250 |
export TOKENIZERS_PARALLELISM=false
|
| 251 |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
| 252 |
|
| 253 |
+
accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine-rank 0 \
|
| 254 |
train_cogvideox_inpainting_i2v_video.py \
|
| 255 |
--pretrained_model_name_or_path $MODEL_PATH \
|
| 256 |
--cache_dir $CACHE_PATH \
|
|
|
|
| 317 |
export TOKENIZERS_PARALLELISM=false
|
| 318 |
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
| 319 |
|
| 320 |
+
accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine-rank 0 \
|
| 321 |
train_cogvideox_inpainting_i2v_video_resample.py \
|
| 322 |
--pretrained_model_name_or_path $MODEL_PATH \
|
| 323 |
--cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
|
|
|
|
| 376 |
--id_pool_resample_learnable
|
| 377 |
```
|
| 378 |
|
|
|
|
|
|
|
|
|
|
| 379 |
### Inference π
|
| 380 |
|
| 381 |
You can inference for the video inpainting or editing with the script:
|
|
|
|
| 397 |
|
| 398 |
Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
|
| 399 |
|
|
|
|
| 400 |
You can also inference through gradio demo:
|
| 401 |
|
| 402 |
```
|
|
|
|
| 408 |
--img_inpainting_model ../ckpt/flux_inp
|
| 409 |
```
|
| 410 |
|
|
|
|
| 411 |
### Evaluation π
|
| 412 |
|
| 413 |
You can evaluate using the script:
|
|
|
|
| 424 |
bash eval_editing_id_resample.sh
|
| 425 |
```
|
| 426 |
|
|
|
|
| 427 |
## π€πΌ Cite Us
|
| 428 |
|
| 429 |
```
|
|
|
|
| 438 |
}
|
| 439 |
```
|
| 440 |
|
|
|
|
| 441 |
## π Acknowledgement
|
| 442 |
<span id="acknowledgement"></span>
|
| 443 |
|
| 444 |
+
Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!
|