File size: 6,194 Bytes
850ddc1
a8ba3c9
850ddc1
 
 
874f1ed
850ddc1
874f1ed
850ddc1
874f1ed
850ddc1
874f1ed
 
 
850ddc1
874f1ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1c2395
874f1ed
 
 
 
 
b1c2395
874f1ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8ba3c9
874f1ed
 
 
 
 
a8ba3c9
 
 
 
874f1ed
 
 
 
 
 
 
a8ba3c9
 
 
 
874f1ed
a8ba3c9
874f1ed
 
 
 
 
 
 
 
 
850ddc1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
pipeline_tag: video-to-video
library_name: diffusers
license: apache-2.0
---

# Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration

[πŸ“š Paper](https://huggingface.co/papers/2508.14483) | [🌐 Project Page](https://csbhr.github.io/projects/vivid-vr/) | [πŸ’» Code](https://github.com/csbhr/Vivid-VR)

<div align="center">
    <img style="width:100%" src="assets/teaser.png">
</div>

For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/)

---

## 🎬 Overview
![overall_structure](assets/framework.png)

## πŸ”§ Dependencies and Installation
1. Clone Repo
    ```bash
    git clone https://github.com/csbhr/Vivid-VR.git
    cd Vivid-VR
    ```

2. Create Conda Environment and Install Dependencies
    ```bash
    # create new conda env
    conda create -n Vivid-VR python=3.10
    conda activate Vivid-VR

    # install pytorch
    pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121

    # install python dependencies
    pip install -r requirements.txt

    # install easyocr [Optional, for text fix]
    pip install easyocr
    pip install numpy==1.26.4  # numpy2.x maybe installed when installing easyocr, which will cause conflicts
    ```

3. Download Models

   - [**Required**] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B).
   - [**Required**] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption).
       - Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo).
   - [**Required**] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR).
   - [**Optional, for text fix**] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip).
   - [**Optional, for text fix**] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth).
   - Put them under the `./ckpts` folder.

   The `ckpts` directory structure should be arranged as:

    ```
    β”œβ”€β”€ ckpts
    β”‚   β”œβ”€β”€ CogVideoX1.5-5B
    β”‚   β”‚   β”œβ”€β”€ ...
    β”‚   β”œβ”€β”€ cogvlm2-llama3-caption
    β”‚   β”‚   β”œβ”€β”€ ...
    β”‚   β”œβ”€β”€ Vivid-VR
    β”‚   β”‚   β”œβ”€β”€ controlnet
    β”‚   β”‚       β”œβ”€β”€ config.json
    β”‚   β”‚       β”œβ”€β”€ diffusion_pytorch_model.safetensors
    β”‚   β”‚   β”œβ”€β”€ connectors.pt
    β”‚   β”‚   β”œβ”€β”€ control_feat_proj.pt
    β”‚   β”‚   β”œβ”€β”€ control_patch_embed.pt
    β”‚   β”œβ”€β”€ easyocr
    β”‚   β”‚   β”œβ”€β”€ craft_mlt_25k.pth
    β”‚   β”‚   β”œβ”€β”€ english_g2.pth
    β”‚   β”‚   β”œβ”€β”€ zh_sim_g2.pth
    β”‚   β”œβ”€β”€ RealESRGAN
    β”‚   β”‚   β”œβ”€β”€ RealESRGAN_x2plus.pth
    ```


## β˜•οΈ Quick Inference

Run the following commands to try it out:

```shell
python VRDiT/inference.py \
    --ckpt_dir=./ckpts \
    --cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \
    --cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \
    --input_dir=/dir/to/input/videos \
    --output_dir=/dir/to/output/videos \
    --num_temporal_process_frames=121 \  # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension
    --upscale=0 \  # Optional, if set to 0, the short-size of output videos will be 1024
    --textfix \  # Optional, if given, the text region will be replaced by the output of Real-ESRGAN
    --save_images  # Optional, if given, the video frames will be saved

```
GPU memory usage:
- For a 121-frame video, it requires approximately **43GB** GPU memory.
- If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to **25GB**, but the inference time is longer.
- For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values ​​require less GPU memory but increase inference time.


## πŸ“§ Citation

   If you find our repo useful for your research, please consider citing it:

   ```bibtex
   @article{bai2025vividvr,
      title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration}, 
      author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
      journal={arXiv preprint arXiv:2508.14483},
      year={2025},
      url={https://arxiv.org/abs/2508.14483}
    }
   ```


## πŸ“„ License
- This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE).
- CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE).
- cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
- Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE).
- easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/).