TencentARC
/

VideoPainter

@@ -1,11 +1,14 @@
 ---
-language:
-- en
 base_model:
 - THUDM/CogVideoX-5b
 - THUDM/CogVideoX-5b-I2V
 - THUDM/CogVideoX1.5-5B
 - THUDM/CogVideoX1.5-5B-I2V
 tags:
 - video
 - video inpainting
@@ -21,8 +24,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
 > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‡</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2✉</sup><br>
 > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‡</sup>Project Lead <sup>✉</sup>Corresponding Author
 <p align="center">
   <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
   <a href="https://arxiv.org/abs/2503.05639">📜Arxiv</a> |
@@ -31,10 +32,8 @@ Keywords: Video Inpainting, Video Editing, Video Generation
   <a href="https://huggingface.co/TencentARC/VideoPainter">🤗Hugging Face Model</a> |
 </p>
 **📖 Table of Contents**
 - [VideoPainter](#videopainter)
   - [🔥 Update Log](#-update-log)
   - [📌 TODO](#todo)
@@ -49,8 +48,6 @@ Keywords: Video Inpainting, Video Editing, Video Generation
   - [🤝🏼 Cite Us](#-cite-us)
   - [💖 Acknowledgement](#-acknowledgement)
 ## 🔥 Update Log
 - [2025/3/09] 📢 📢  [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
 - [2025/3/09] 📢 📢  [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
@@ -68,13 +65,10 @@ Keywords: Video Inpainting, Video Editing, Video Generation
 We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
 ![](assets/method.jpg)
 ## 🚀 Getting Started
 ### Environment Requirement 🌍
 Clone the repo:
 ```
@@ -83,7 +77,6 @@ git clone https://github.com/TencentARC/VideoPainter.git
 We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
 ```
 conda create -n videopainter python=3.10 -y
 conda activate videopainter
@@ -112,7 +105,6 @@ pip install -e .
 ### Data Download ⬇️
 **VPBench and VPData**
 You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
@@ -186,7 +178,6 @@ cd data_utils
 python VPData_download.py
 ```
 **Checkpoints**
 Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
@@ -217,7 +208,6 @@ git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
 mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
 ```
 The ckpt structure should be like:
 ```
@@ -240,10 +230,8 @@ The ckpt structure should be like:
         |-- ...
 ```
 ## 🏃🏼 Running Scripts
 ### Training 🤯
 You can train the VideoPainter using the script:
@@ -262,7 +250,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
 export TOKENIZERS_PARALLELISM=false
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-accelerate launch --config_file accelerate_config_machine_single_ds.yaml  --machine_rank 0 \
   train_cogvideox_inpainting_i2v_video.py \
   --pretrained_model_name_or_path $MODEL_PATH \
   --cache_dir $CACHE_PATH \
@@ -329,7 +317,7 @@ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
 export TOKENIZERS_PARALLELISM=false
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
-accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
   train_cogvideox_inpainting_i2v_video_resample.py \
   --pretrained_model_name_or_path $MODEL_PATH \
   --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
@@ -388,9 +376,6 @@ accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml
   --id_pool_resample_learnable
 ```
 ### Inference 📜
 You can inference for the video inpainting or editing with the script:
@@ -412,7 +397,6 @@ bash edit_bench.sh
 Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
 You can also inference through gradio demo:
 ```
@@ -424,7 +408,6 @@ CUDA_VISIBLE_DEVICES=0 python app.py \
     --img_inpainting_model ../ckpt/flux_inp
 ```
 ### Evaluation 📏
 You can evaluate using the script:
@@ -441,7 +424,6 @@ bash eval_edit.sh
 bash eval_editing_id_resample.sh
 ```
 ## 🤝🏼 Cite Us
 ```
@@ -456,8 +438,7 @@ bash eval_editing_id_resample.sh
 }
 ```
 ## 💖 Acknowledgement
 <span id="acknowledgement"></span>
-Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!

 ---
 base_model:
 - THUDM/CogVideoX-5b
 - THUDM/CogVideoX-5b-I2V
 - THUDM/CogVideoX1.5-5B
 - THUDM/CogVideoX1.5-5B-I2V
+library_name: diffusers
+license: mit
+pipeline_tag: image-to-video
+language:
+- en
 tags:
 - video
 - video inpainting
 > [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‡</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2✉</sup><br>
 > <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‡</sup>Project Lead <sup>✉</sup>Corresponding Author
 <p align="center">
   <a href="https://yxbian23.github.io/project/video-painter">🌐Project Page</a> |
   <a href="https://arxiv.org/abs/2503.05639">📜Arxiv</a> |
   <a href="https://huggingface.co/TencentARC/VideoPainter">🤗Hugging Face Model</a> |
 </p>
 **📖 Table of Contents**
 - [VideoPainter](#videopainter)
   - [🔥 Update Log](#-update-log)
   - [📌 TODO](#todo)
   - [🤝🏼 Cite Us](#-cite-us)
   - [💖 Acknowledgement](#-acknowledgement)
 ## 🔥 Update Log
 - [2025/3/09] 📢 📢  [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
 - [2025/3/09] 📢 📢  [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
 We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
 ![](assets/method.jpg)
 ## 🚀 Getting Started
 ### Environment Requirement 🌍
 Clone the repo:
 ```
 We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:
 ```
 conda create -n videopainter python=3.10 -y
 conda activate videopainter
 ### Data Download ⬇️
 **VPBench and VPData**
 You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:
 python VPData_download.py
 ```
 **Checkpoints**
 Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains
 mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
 ```
 The ckpt structure should be like:
 ```
         |-- ...
 ```
 ## 🏃🏼 Running Scripts
 ### Training 🤯
 You can train the VideoPainter using the script:
 export TOKENIZERS_PARALLELISM=false
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+accelerate launch --config_file accelerate_config_machine_single_ds.yaml  --machine-rank 0 \
   train_cogvideox_inpainting_i2v_video.py \
   --pretrained_model_name_or_path $MODEL_PATH \
   --cache_dir $CACHE_PATH \
 export TOKENIZERS_PARALLELISM=false
 export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine-rank 0 \
   train_cogvideox_inpainting_i2v_video_resample.py \
   --pretrained_model_name_or_path $MODEL_PATH \
   --cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
   --id_pool_resample_learnable
 ```
 ### Inference 📜
 You can inference for the video inpainting or editing with the script:
 Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
 You can also inference through gradio demo:
 ```
     --img_inpainting_model ../ckpt/flux_inp
 ```
 ### Evaluation 📏
 You can evaluate using the script:
 bash eval_editing_id_resample.sh
 ```
 ## 🤝🏼 Cite Us
 ```
 }
 ```
 ## 💖 Acknowledgement
 <span id="acknowledgement"></span>
+Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!