Vivid-VR / README.md

Update README.md

b1c2395 verified 7 months ago

6.19 kB

	---
	pipeline_tag: video-to-video
	library_name: diffusers
	license: apache-2.0
	---

	# Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

	[📚 Paper](https://huggingface.co/papers/2508.14483) \| [🌐 Project Page](https://csbhr.github.io/projects/vivid-vr/) \| [💻 Code](https://github.com/csbhr/Vivid-VR)

	<div align="center">
	<img style="width:100%" src="assets/teaser.png">
	</div>

	For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/)

	---

	## 🎬 Overview
	![overall_structure](assets/framework.png)

	## 🔧 Dependencies and Installation
	1. Clone Repo
	```bash
	git clone https://github.com/csbhr/Vivid-VR.git
	cd Vivid-VR
	```

	2. Create Conda Environment and Install Dependencies
	```bash
	# create new conda env
	conda create -n Vivid-VR python=3.10
	conda activate Vivid-VR

	# install pytorch
	pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121

	# install python dependencies
	pip install -r requirements.txt

	# install easyocr [Optional, for text fix]
	pip install easyocr
	pip install numpy==1.26.4 # numpy2.x maybe installed when installing easyocr, which will cause conflicts
	```

	3. Download Models

	- [Required] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B).
	- [Required] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption).
	- Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo).
	- [Required] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR).
	- [Optional, for text fix] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip).
	- [Optional, for text fix] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth).
	- Put them under the `./ckpts` folder.

	The `ckpts` directory structure should be arranged as:

	```
	├── ckpts
	│ ├── CogVideoX1.5-5B
	│ │ ├── ...
	│ ├── cogvlm2-llama3-caption
	│ │ ├── ...
	│ ├── Vivid-VR
	│ │ ├── controlnet
	│ │ ├── config.json
	│ │ ├── diffusion_pytorch_model.safetensors
	│ │ ├── connectors.pt
	│ │ ├── control_feat_proj.pt
	│ │ ├── control_patch_embed.pt
	│ ├── easyocr
	│ │ ├── craft_mlt_25k.pth
	│ │ ├── english_g2.pth
	│ │ ├── zh_sim_g2.pth
	│ ├── RealESRGAN
	│ │ ├── RealESRGAN_x2plus.pth
	```


	## ☕️ Quick Inference

	Run the following commands to try it out:

	```shell
	python VRDiT/inference.py \
	--ckpt_dir=./ckpts \
	--cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \
	--cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \
	--input_dir=/dir/to/input/videos \
	--output_dir=/dir/to/output/videos \
	--num_temporal_process_frames=121 \ # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension
	--upscale=0 \ # Optional, if set to 0, the short-size of output videos will be 1024
	--textfix \ # Optional, if given, the text region will be replaced by the output of Real-ESRGAN
	--save_images # Optional, if given, the video frames will be saved

	```
	GPU memory usage:
	- For a 121-frame video, it requires approximately 43GB GPU memory.
	- If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to 25GB, but the inference time is longer.
	- For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values require less GPU memory but increase inference time.


	## 📧 Citation

	If you find our repo useful for your research, please consider citing it:

	```bibtex
	@article{bai2025vividvr,
	title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration},
	author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
	journal={arXiv preprint arXiv:2508.14483},
	year={2025},
	url={https://arxiv.org/abs/2508.14483}
	}
	```


	## 📄 License
	- This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE).
	- CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE).
	- cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
	- Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE).
	- easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/).