| | --- |
| | pipeline_tag: video-to-video |
| | library_name: diffusers |
| | license: apache-2.0 |
| | --- |
| | |
| | # Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration |
| |
|
| | [π Paper](https://huggingface.co/papers/2508.14483) | [π Project Page](https://csbhr.github.io/projects/vivid-vr/) | [π» Code](https://github.com/csbhr/Vivid-VR) |
| |
|
| | <div align="center"> |
| | <img style="width:100%" src="assets/teaser.png"> |
| | </div> |
| | |
| | For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/) |
| |
|
| | --- |
| |
|
| | ## π¬ Overview |
| |  |
| |
|
| | ## π§ Dependencies and Installation |
| | 1. Clone Repo |
| | ```bash |
| | git clone https://github.com/csbhr/Vivid-VR.git |
| | cd Vivid-VR |
| | ``` |
| | |
| | 2. Create Conda Environment and Install Dependencies |
| | ```bash |
| | # create new conda env |
| | conda create -n Vivid-VR python=3.10 |
| | conda activate Vivid-VR |
| | |
| | # install pytorch |
| | pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121 |
| | |
| | # install python dependencies |
| | pip install -r requirements.txt |
| | |
| | # install easyocr [Optional, for text fix] |
| | pip install easyocr |
| | pip install numpy==1.26.4 # numpy2.x maybe installed when installing easyocr, which will cause conflicts |
| | ``` |
| | |
| | 3. Download Models |
| |
|
| | - [**Required**] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B). |
| | - [**Required**] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption). |
| | - Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo). |
| | - [**Required**] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR). |
| | - [**Optional, for text fix**] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip). |
| | - [**Optional, for text fix**] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth). |
| | - Put them under the `./ckpts` folder. |
| |
|
| | The `ckpts` directory structure should be arranged as: |
| |
|
| | ``` |
| | βββ ckpts |
| | β βββ CogVideoX1.5-5B |
| | β β βββ ... |
| | β βββ cogvlm2-llama3-caption |
| | β β βββ ... |
| | β βββ Vivid-VR |
| | β β βββ controlnet |
| | β β βββ config.json |
| | β β βββ diffusion_pytorch_model.safetensors |
| | β β βββ connectors.pt |
| | β β βββ control_feat_proj.pt |
| | β β βββ control_patch_embed.pt |
| | β βββ easyocr |
| | β β βββ craft_mlt_25k.pth |
| | β β βββ english_g2.pth |
| | β β βββ zh_sim_g2.pth |
| | β βββ RealESRGAN |
| | β β βββ RealESRGAN_x2plus.pth |
| | ``` |
| | |
| |
|
| | ## βοΈ Quick Inference |
| |
|
| | Run the following commands to try it out: |
| |
|
| | ```shell |
| | python VRDiT/inference.py \ |
| | --ckpt_dir=./ckpts \ |
| | --cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \ |
| | --cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \ |
| | --input_dir=/dir/to/input/videos \ |
| | --output_dir=/dir/to/output/videos \ |
| | --num_temporal_process_frames=121 \ # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension |
| | --upscale=0 \ # Optional, if set to 0, the short-size of output videos will be 1024 |
| | --textfix \ # Optional, if given, the text region will be replaced by the output of Real-ESRGAN |
| | --save_images # Optional, if given, the video frames will be saved |
| | |
| | ``` |
| | GPU memory usage: |
| | - For a 121-frame video, it requires approximately **43GB** GPU memory. |
| | - If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to **25GB**, but the inference time is longer. |
| | - For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values ββrequire less GPU memory but increase inference time. |
| |
|
| |
|
| | ## π§ Citation |
| |
|
| | If you find our repo useful for your research, please consider citing it: |
| |
|
| | ```bibtex |
| | @article{bai2025vividvr, |
| | title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration}, |
| | author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying}, |
| | journal={arXiv preprint arXiv:2508.14483}, |
| | year={2025}, |
| | url={https://arxiv.org/abs/2508.14483} |
| | } |
| | ``` |
| |
|
| |
|
| | ## π License |
| | - This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE). |
| | - CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE). |
| | - cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0). |
| | - Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE). |
| | - easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/). |