Duplicate from TencentARC/VideoPainter

b4b5bd2 19 days ago

17.1 kB

	---
	language:
	- en
	base_model:
	- THUDM/CogVideoX-5b
	- THUDM/CogVideoX-5b-I2V
	- THUDM/CogVideoX1.5-5B
	- THUDM/CogVideoX1.5-5B-I2V
	tags:
	- video
	- video inpainting
	- video editing
	---


	# VideoPainter

	This repository contains the implementation of the paper "VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control"

	Keywords: Video Inpainting, Video Editing, Video Generation

	> [Yuxuan Bian](https://yxbian23.github.io/)<sup>12</sup>, [Zhaoyang Zhang](https://zzyfd.github.io/#/)<sup>1‡</sup>, [Xuan Ju](https://juxuan27.github.io/)<sup>2</sup>, [Mingdeng Cao](https://openreview.net/profile?id=~Mingdeng_Cao1)<sup>3</sup>, [Liangbin Xie](https://liangbinxie.github.io/)<sup>4</sup>, [Ying Shan](https://www.linkedin.com/in/YingShanProfile/)<sup>1</sup>, [Qiang Xu](https://cure-lab.github.io/)<sup>2✉</sup><br>
	> <sup>1</sup>ARC Lab, Tencent PCG <sup>2</sup>The Chinese University of Hong Kong <sup>3</sup>The University of Tokyo <sup>4</sup>University of Macau <sup>‡</sup>Project Lead <sup>✉</sup>Corresponding Author



	<p align="center">
	<a href='https://yxbian23.github.io/project/video-painter'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
	<a href="https://arxiv.org/abs/2503.05639"><img src="https://img.shields.io/badge/arXiv-2503.05639-b31b1b.svg"></a>
	<a href="https://github.com/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a>
	<a href="https://youtu.be/HYzNfsD3A0s"><img src="https://img.shields.io/badge/YouTube-Video-red?logo=youtube"></a>
	<a href='https://huggingface.co/datasets/TencentARC/VPData'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a>
	<a href='https://huggingface.co/datasets/TencentARC/VPBench'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Benchmark-blue'></a>
	<a href="https://huggingface.co/TencentARC/VideoPainter"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue"></a>
	</p>

	Your star means a lot for us to develop this project! ⭐⭐⭐

	VPData and VPBench have been fully uploaded (contain 390K mask sequences and video captions). Welcome to use our biggest video segmentation dataset VPData with video captions! 🔥🔥🔥


	📖 Table of Contents


	- [VideoPainter](#videopainter)
	- [🔥 Update Log](#-update-log)
	- [📌 TODO](#todo)
	- [🛠️ Method Overview](#️-method-overview)
	- [🚀 Getting Started](#-getting-started)
	- [Environment Requirement 🌍](#environment-requirement-)
	- [Data Download ⬇️](#data-download-️)
	- [🏃🏼 Running Scripts](#-running-scripts)
	- [Training 🤯](#training-)
	- [Inference 📜](#inference-)
	- [Evaluation 📏](#evaluation-)
	- [🤝🏼 Cite Us](#-cite-us)
	- [💖 Acknowledgement](#-acknowledgement)



	## 🔥 Update Log
	- [2025/3/09] 📢 📢 [VideoPainter](https://huggingface.co/TencentARC/VideoPainter) are released, an efficient, any-length video inpainting & editing framework with plug-and-play context control.
	- [2025/3/09] 📢 📢 [VPData](https://huggingface.co/datasets/TencentARC/VPData) and [VPBench](https://huggingface.co/datasets/TencentARC/VPBench) are released, the largest video inpainting dataset with precise segmentation masks and dense video captions (>390K clips).
	- [2025/3/25] 📢 📢 The 390K+ high-quality video segmentation masks of [VPData](https://huggingface.co/datasets/TencentARC/VPData) have been fully released.
	- [2025/3/25] 📢 📢 The raw videos of videovo subset have been uploaded to [VPData](https://huggingface.co/datasets/TencentARC/VPData), to solve the raw video link expiration issue.

	## TODO

	- [x] Release trainig and inference code
	- [x] Release evaluation code
	- [x] Release [VideoPainter checkpoints](https://huggingface.co/TencentARC/VideoPainter) (based on CogVideoX-5B)
	- [x] Release [VPData and VPBench](https://huggingface.co/collections/TencentARC/videopainter-67cc49c6146a48a2ba93d159) for large-scale training and evaluation.
	- [x] Release gradio demo
	- [ ] Data preprocessing code
	## 🛠️ Method Overview

	We propose a novel dual-stream paradigm VideoPainter that incorporates an efficient context encoder (comprising only 6\% of the backbone parameters) to process masked videos and inject backbone-aware background contextual cues to any pre-trained video DiT, producing semantically consistent content in a plug-and-play manner. This architectural separation significantly reduces the model's learning complexity while enabling nuanced integration of crucial background context. We also introduce a novel target region ID resampling technique that enables any-length video inpainting, greatly enhancing our practical applicability. Additionally, we establish a scalable dataset pipeline leveraging current vision understanding models, contributing VPData and VPBench to facilitate segmentation-based inpainting training and assessment, the largest video inpainting dataset and benchmark to date with over 390K diverse clips. Using inpainting as a pipeline basis, we also explore downstream applications including video editing and video editing pair data generation, demonstrating competitive performance and significant practical potential.
	![](assets/teaser.jpg)



	## 🚀 Getting Started

	<details>
	<summary><b>Environment Requirement 🌍</b></summary>


	Clone the repo:

	```
	git clone https://github.com/TencentARC/VideoPainter.git
	```

	We recommend you first use `conda` to create virtual environment, and install needed libraries. For example:


	```
	conda create -n videopainter python=3.10 -y
	conda activate videopainter
	pip install -r requirements.txt
	```

	Then, you can install diffusers (implemented in this repo) with:

	```
	cd ./diffusers
	pip install -e .
	```

	After that, you can install required ffmpeg thourgh:

	```
	conda install -c conda-forge ffmpeg -y
	```

	Optional, you can install sam2 for gradio demo thourgh:

	```
	cd ./app
	pip install -e .
	```
	</details>

	<details>
	<summary><b>VPBench and VPData Download ⬇️</b></summary>

	You can download the VPBench [here](https://huggingface.co/datasets/TencentARC/VPBench), and the VPData [here](https://huggingface.co/datasets/TencentARC/VPData) (as well as the Davis we re-processed), which are used for training and testing the BrushNet. By downloading the data, you are agreeing to the terms and conditions of the license. The data structure should be like:

	```
	\|-- data
	\|-- davis
	\|-- JPEGImages_432_240
	\|-- test_masks
	\|-- davis_caption
	\|-- test.json
	\|-- train.json
	\|-- videovo/raw_video
	\|-- 000005000
	\|-- 000005000000.0.mp4
	\|-- 000005000001.0.mp4
	\|-- ...
	\|-- 000005001
	\|-- ...
	\|-- pexels/pexels/raw_video
	\|-- 000000000
	\|-- 000000000000_852038.mp4
	\|-- 000000000001_852057.mp4
	\|-- ...
	\|-- 000000001
	\|-- ...
	\|-- video_inpainting
	\|-- videovo
	\|-- 000005000000/all_masks.npz
	\|-- 000005000001/all_masks.npz
	\|-- ...
	\|-- pexels
	\|-- ...
	\|-- pexels_videovo_train_dataset.csv
	\|-- pexels_videovo_val_dataset.csv
	\|-- pexels_videovo_test_dataset.csv
	\|-- our_video_inpaint.csv
	\|-- our_video_inpaint_long.csv
	\|-- our_video_edit.csv
	\|-- our_video_edit_long.csv
	\|-- pexels.csv
	\|-- videovo.csv

	```

	You can download the VPBench, and put the benchmark to the `data` folder by:
	```
	git lfs install
	git clone https://huggingface.co/datasets/TencentARC/VPBench
	mv VPBench data
	cd data
	unzip pexels.zip
	unzip videovo.zip
	unzip davis.zip
	unzip video_inpainting.zip
	```

	You can download the VPData (only mask and text annotations due to the space limit), and put the dataset to the `data` folder by:
	```
	git lfs install
	git clone https://huggingface.co/datasets/TencentARC/VPData
	mv VPBench data

	# 1. unzip the masks in VPData
	python data_utils/unzip_folder.py --source_dir ./data/videovo_masks --target_dir ./data/video_inpainting/videovo
	python data_utils/unzip_folder.py --source_dir ./data/pexels_masks --target_dir ./data/video_inpainting/pexels

	# 2. unzip the raw videos in Videovo subset in VPData
	python data_utils/unzip_folder.py --source_dir ./data/videovo_raw_videos --target_dir ./data/videovo/raw_video
	```

	Noted: Due to the space limit, you need to run the following script to download the raw videos of the Pexels subset in VPData. The format should be consistent with VPData/VPBench above (After download the VPData/VPBench, the script will automatically place the raw videos of VPData into the corresponding dataset directories that have been created by VPBench).

	```
	cd data_utils
	python VPData_download.py
	```

	</details>

	<details>
	<summary><b>Checkpoints</b></summary>

	Checkpoints of VideoPainter can be downloaded from [here](https://huggingface.co/TencentARC/VideoPainter). The ckpt folder contains

	- VideoPainter pretrained checkpoints for CogVideoX-5b-I2V
	- VideoPainter IP Adapter pretrained checkpoints for CogVideoX-5b-I2V
	- pretrinaed CogVideoX-5b-I2V checkpoint from [HuggingFace](https://huggingface.co/THUDM/CogVideoX-5b-I2V).

	You can download the checkpoints, and put the checkpoints to the `ckpt` folder by:
	```
	git lfs install
	git clone https://huggingface.co/TencentARC/VideoPainter
	mv VideoPainter ckpt
	```

	You also need to download the base model [CogVideoX-5B-I2V](https://huggingface.co/THUDM/CogVideoX-5b-I2V) by:
	```
	git lfs install
	cd ckpt
	git clone https://huggingface.co/THUDM/CogVideoX-5b-I2V
	```

	[Optional]You need to download [FLUX.1-Fill-dev](https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev/) for first frame inpainting:
	```
	git lfs install
	cd ckpt
	git clone https://huggingface.co/black-forest-labs/FLUX.1-Fill-dev
	mv ckpt/FLUX.1-Fill-dev ckpt/flux_inp
	```

	[Optional]You need to download [SAM2](https://huggingface.co/facebook/sam2-hiera-large) for video segmentation in gradio demo:
	```
	git lfs install
	cd ckpt
	wget https://huggingface.co/facebook/sam2-hiera-large/resolve/main/sam2_hiera_large.pt
	```
	You can also choose the segmentation checkpoints of other sizes to balance efficiency and performance, such as [SAM2-Tiny](https://huggingface.co/facebook/sam2-hiera-tiny).

	The ckpt structure should be like:

	```
	\|-- ckpt
	\|-- VideoPainter/checkpoints
	\|-- branch
	\|-- config.json
	\|-- diffusion_pytorch_model.safetensors
	\|-- VideoPainterID/checkpoints
	\|-- pytorch_lora_weights.safetensors
	\|-- CogVideoX-5b-I2V
	\|-- scheduler
	\|-- transformer
	\|-- vae
	\|-- ...
	\|-- flux_inp
	\|-- scheduler
	\|-- transformer
	\|-- vae
	\|-- ...
	\|-- sam2_hiera_large.pt
	```
	</details>

	## 🏃🏼 Running Scripts

	<details>
	<summary><b>Training 🤯</b></summary>

	You can train the VideoPainter using the script:

	```
	# cd train
	# bash VideoPainter.sh

	export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
	export CACHE_PATH="~/.cache"
	export DATASET_PATH="../data/videovo/raw_video"
	export PROJECT_NAME="pexels_videovo-inpainting"
	export RUNS_NAME="VideoPainter"
	export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
	export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
	export TOKENIZERS_PARALLELISM=false
	export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

	accelerate launch --config_file accelerate_config_machine_single_ds.yaml --machine_rank 0 \
	train_cogvideox_inpainting_i2v_video.py \
	--pretrained_model_name_or_path $MODEL_PATH \
	--cache_dir $CACHE_PATH \
	--meta_file_path ../data/pexels_videovo_train_dataset.csv \
	--val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
	--instance_data_root $DATASET_PATH \
	--dataloader_num_workers 1 \
	--num_validation_videos 1 \
	--validation_epochs 1 \
	--seed 42 \
	--mixed_precision bf16 \
	--output_dir $OUTPUT_PATH \
	--height 480 \
	--width 720 \
	--fps 8 \
	--max_num_frames 49 \
	--video_reshape_mode "resize" \
	--skip_frames_start 0 \
	--skip_frames_end 0 \
	--max_text_seq_length 226 \
	--branch_layer_num 2 \
	--train_batch_size 1 \
	--num_train_epochs 10 \
	--checkpointing_steps 1024 \
	--validating_steps 256 \
	--gradient_accumulation_steps 1 \
	--learning_rate 1e-5 \
	--lr_scheduler cosine_with_restarts \
	--lr_warmup_steps 1000 \
	--lr_num_cycles 1 \
	--enable_slicing \
	--enable_tiling \
	--noised_image_dropout 0.05 \
	--gradient_checkpointing \
	--optimizer AdamW \
	--adam_beta1 0.9 \
	--adam_beta2 0.95 \
	--max_grad_norm 1.0 \
	--allow_tf32 \
	--report_to wandb \
	--tracker_name $PROJECT_NAME \
	--runs_name $RUNS_NAME \
	--inpainting_loss_weight 1.0 \
	--mix_train_ratio 0 \
	--first_frame_gt \
	--mask_add \
	--mask_transform_prob 0.3 \
	--p_brush 0.4 \
	--p_rect 0.1 \
	--p_ellipse 0.1 \
	--p_circle 0.1 \
	--p_random_brush 0.3

	# cd train
	# bash VideoPainterID.sh
	export MODEL_PATH="../ckpt/CogVideoX-5b-I2V"
	export BRANCH_MODEL_PATH="../ckpt/VideoPainter/checkpoints/branch"
	export CACHE_PATH="~/.cache"
	export DATASET_PATH="../data/videovo/raw_video"
	export PROJECT_NAME="pexels_videovo-inpainting"
	export RUNS_NAME="VideoPainterID"
	export OUTPUT_PATH="./${PROJECT_NAME}/${RUNS_NAME}"
	export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
	export TOKENIZERS_PARALLELISM=false
	export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

	accelerate launch --config_file accelerate_config_machine_single_ds_wo_cpu.yaml --machine_rank 0 \
	train_cogvideox_inpainting_i2v_video_resample.py \
	--pretrained_model_name_or_path $MODEL_PATH \
	--cogvideox_branch_name_or_path $BRANCH_MODEL_PATH \
	--cache_dir $CACHE_PATH \
	--meta_file_path ../data/pexels_videovo_train_dataset.csv \
	--val_meta_file_path ../data/pexels_videovo_val_dataset.csv \
	--instance_data_root $DATASET_PATH \
	--dataloader_num_workers 1 \
	--num_validation_videos 1 \
	--validation_epochs 1 \
	--seed 42 \
	--rank 256 \
	--lora_alpha 128 \
	--mixed_precision bf16 \
	--output_dir $OUTPUT_PATH \
	--height 480 \
	--width 720 \
	--fps 8 \
	--max_num_frames 49 \
	--video_reshape_mode "resize" \
	--skip_frames_start 0 \
	--skip_frames_end 0 \
	--max_text_seq_length 226 \
	--branch_layer_num 2 \
	--train_batch_size 1 \
	--num_train_epochs 10 \
	--checkpointing_steps 256 \
	--validating_steps 128 \
	--gradient_accumulation_steps 1 \
	--learning_rate 5e-5 \
	--lr_scheduler cosine_with_restarts \
	--lr_warmup_steps 200 \
	--lr_num_cycles 1 \
	--enable_slicing \
	--enable_tiling \
	--noised_image_dropout 0.05 \
	--gradient_checkpointing \
	--optimizer AdamW \
	--adam_beta1 0.9 \
	--adam_beta2 0.95 \
	--max_grad_norm 1.0 \
	--allow_tf32 \
	--report_to wandb \
	--tracker_name $PROJECT_NAME \
	--runs_name $RUNS_NAME \
	--inpainting_loss_weight 1.0 \
	--mix_train_ratio 0 \
	--first_frame_gt \
	--mask_add \
	--mask_transform_prob 0.3 \
	--p_brush 0.4 \
	--p_rect 0.1 \
	--p_ellipse 0.1 \
	--p_circle 0.1 \
	--p_random_brush 0.3 \
	--id_pool_resample_learnable
	```
	</details>


	<details>
	<summary><b>Inference 📜</b></summary>

	You can inference for the video inpainting or editing with the script:

	```
	cd infer
	# video inpainting
	bash inpaint.sh
	# video inpainting with ID resampling
	bash inpaint_id_resample.sh
	# video editing
	bash edit.sh
	```

	Our VideoPainter can also function as a video editing pair data generator, you can inference with the script:
	```
	bash edit_bench.sh
	```

	Since VideoPainter is trained on public Internet videos, it primarily performs well on general scenarios. For high-quality industrial applications (e.g., product exhibitions, virtual try-on), we recommend training the model on your domain-specific data. We welcome and appreciate any contributions of trained models from the community!
	</details>

	<details>
	<summary><b>Gradio Demo 🖌️</b></summary>

	You can also inference through gradio demo:

	```
	# cd app
	CUDA_VISIBLE_DEVICES=0 python app.py \
	--model_path ../ckpt/CogVideoX-5b-I2V \
	--inpainting_branch ../ckpt/VideoPainter/checkpoints/branch \
	--id_adapter ../ckpt/VideoPainterID/checkpoints \
	--img_inpainting_model ../ckpt/flux_inp
	```
	</details>


	<details>
	<summary><b>Evaluation 📏</b></summary>

	You can evaluate using the script:

	```
	cd evaluate
	# video inpainting
	bash eval_inpainting.sh
	# video inpainting with ID resampling
	bash eval_inpainting_id_resample.sh
	# video editing
	bash eval_edit.sh
	# video editing with ID resampling
	bash eval_editing_id_resample.sh
	```
	</details>

	## 🤝🏼 Cite Us

	```
	@article{bian2025videopainter,
	title={VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control},
	author={Bian, Yuxuan and Zhang, Zhaoyang and Ju, Xuan and Cao, Mingdeng and Xie, Liangbin and Shan, Ying and Xu, Qiang},
	journal={arXiv preprint arXiv:2503.05639},
	year={2025}
	}
	```


	## 💖 Acknowledgement
	<span id="acknowledgement"></span>

	Our code is modified based on [diffusers](https://github.com/huggingface/diffusers) and [CogVideoX](https://github.com/THUDM/CogVideo), thanks to all the contributors!