File size: 6,194 Bytes

---
pipeline_tag: video-to-video
library_name: diffusers
license: apache-2.0
---

# Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

[📚 Paper](https://huggingface.co/papers/2508.14483) | [🌐 Project Page](https://csbhr.github.io/projects/vivid-vr/) | [💻 Code](https://github.com/csbhr/Vivid-VR)

<div align="center">
    <img style="width:100%" src="assets/teaser.png">
</div>

For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/)

---

## 🎬 Overview
![overall_structure](assets/framework.png)

## 🔧 Dependencies and Installation
1. Clone Repo
    ```bash
    git clone https://github.com/csbhr/Vivid-VR.git
    cd Vivid-VR
    ```

2. Create Conda Environment and Install Dependencies
    ```bash
    # create new conda env
    conda create -n Vivid-VR python=3.10
    conda activate Vivid-VR

    # install pytorch
    pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121

    # install python dependencies
    pip install -r requirements.txt

    # install easyocr [Optional, for text fix]
    pip install easyocr
    pip install numpy==1.26.4  # numpy2.x maybe installed when installing easyocr, which will cause conflicts
    ```

3. Download Models

   - [**Required**] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B).
   - [**Required**] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption).
       - Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo).
   - [**Required**] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR).
   - [**Optional, for text fix**] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip).
   - [**Optional, for text fix**] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth).
   - Put them under the `./ckpts` folder.

   The `ckpts` directory structure should be arranged as:

    ```
    ├── ckpts
    │   ├── CogVideoX1.5-5B
    │   │   ├── ...
    │   ├── cogvlm2-llama3-caption
    │   │   ├── ...
    │   ├── Vivid-VR
    │   │   ├── controlnet
    │   │       ├── config.json
    │   │       ├── diffusion_pytorch_model.safetensors
    │   │   ├── connectors.pt
    │   │   ├── control_feat_proj.pt
    │   │   ├── control_patch_embed.pt
    │   ├── easyocr
    │   │   ├── craft_mlt_25k.pth
    │   │   ├── english_g2.pth
    │   │   ├── zh_sim_g2.pth
    │   ├── RealESRGAN
    │   │   ├── RealESRGAN_x2plus.pth
    ```


## ☕️ Quick Inference

Run the following commands to try it out:

```shell
python VRDiT/inference.py \
    --ckpt_dir=./ckpts \
    --cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \
    --cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \
    --input_dir=/dir/to/input/videos \
    --output_dir=/dir/to/output/videos \
    --num_temporal_process_frames=121 \  # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension
    --upscale=0 \  # Optional, if set to 0, the short-size of output videos will be 1024
    --textfix \  # Optional, if given, the text region will be replaced by the output of Real-ESRGAN
    --save_images  # Optional, if given, the video frames will be saved

```
GPU memory usage:
- For a 121-frame video, it requires approximately **43GB** GPU memory.
- If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to **25GB**, but the inference time is longer.
- For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values require less GPU memory but increase inference time.


## 📧 Citation

   If you find our repo useful for your research, please consider citing it:

   ```bibtex
   @article{bai2025vividvr,
      title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration}, 
      author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
      journal={arXiv preprint arXiv:2508.14483},
      year={2025},
      url={https://arxiv.org/abs/2508.14483}
    }
   ```


## 📄 License
- This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE).
- CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE).
- cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
- Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE).
- easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/).