--- pipeline_tag: video-to-video library_name: diffusers license: apache-2.0 --- # Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration [πŸ“š Paper](https://huggingface.co/papers/2508.14483) | [🌐 Project Page](https://csbhr.github.io/projects/vivid-vr/) | [πŸ’» Code](https://github.com/csbhr/Vivid-VR)
For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/) --- ## 🎬 Overview ![overall_structure](assets/framework.png) ## πŸ”§ Dependencies and Installation 1. Clone Repo ```bash git clone https://github.com/csbhr/Vivid-VR.git cd Vivid-VR ``` 2. Create Conda Environment and Install Dependencies ```bash # create new conda env conda create -n Vivid-VR python=3.10 conda activate Vivid-VR # install pytorch pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121 # install python dependencies pip install -r requirements.txt # install easyocr [Optional, for text fix] pip install easyocr pip install numpy==1.26.4 # numpy2.x maybe installed when installing easyocr, which will cause conflicts ``` 3. Download Models - [**Required**] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B). - [**Required**] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption). - Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo). - [**Required**] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR). - [**Optional, for text fix**] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip). - [**Optional, for text fix**] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth). - Put them under the `./ckpts` folder. The `ckpts` directory structure should be arranged as: ``` β”œβ”€β”€ ckpts β”‚ β”œβ”€β”€ CogVideoX1.5-5B β”‚ β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ cogvlm2-llama3-caption β”‚ β”‚ β”œβ”€β”€ ... β”‚ β”œβ”€β”€ Vivid-VR β”‚ β”‚ β”œβ”€β”€ controlnet β”‚ β”‚ β”œβ”€β”€ config.json β”‚ β”‚ β”œβ”€β”€ diffusion_pytorch_model.safetensors β”‚ β”‚ β”œβ”€β”€ connectors.pt β”‚ β”‚ β”œβ”€β”€ control_feat_proj.pt β”‚ β”‚ β”œβ”€β”€ control_patch_embed.pt β”‚ β”œβ”€β”€ easyocr β”‚ β”‚ β”œβ”€β”€ craft_mlt_25k.pth β”‚ β”‚ β”œβ”€β”€ english_g2.pth β”‚ β”‚ β”œβ”€β”€ zh_sim_g2.pth β”‚ β”œβ”€β”€ RealESRGAN β”‚ β”‚ β”œβ”€β”€ RealESRGAN_x2plus.pth ``` ## β˜•οΈ Quick Inference Run the following commands to try it out: ```shell python VRDiT/inference.py \ --ckpt_dir=./ckpts \ --cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \ --cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \ --input_dir=/dir/to/input/videos \ --output_dir=/dir/to/output/videos \ --num_temporal_process_frames=121 \ # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension --upscale=0 \ # Optional, if set to 0, the short-size of output videos will be 1024 --textfix \ # Optional, if given, the text region will be replaced by the output of Real-ESRGAN --save_images # Optional, if given, the video frames will be saved ``` GPU memory usage: - For a 121-frame video, it requires approximately **43GB** GPU memory. - If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to **25GB**, but the inference time is longer. - For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values ​​require less GPU memory but increase inference time. ## πŸ“§ Citation If you find our repo useful for your research, please consider citing it: ```bibtex @article{bai2025vividvr, title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration}, author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying}, journal={arXiv preprint arXiv:2508.14483}, year={2025}, url={https://arxiv.org/abs/2508.14483} } ``` ## πŸ“„ License - This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE). - CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE). - cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0). - Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE). - easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/).