---
pipeline_tag: video-to-video
library_name: diffusers
license: apache-2.0
---
# Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
[π Paper](https://huggingface.co/papers/2508.14483) | [π Project Page](https://csbhr.github.io/projects/vivid-vr/) | [π» Code](https://github.com/csbhr/Vivid-VR)
For more quantitative results and visual results, go checkout our [project page](https://csbhr.github.io/projects/vivid-vr/)
---
## π¬ Overview

## π§ Dependencies and Installation
1. Clone Repo
```bash
git clone https://github.com/csbhr/Vivid-VR.git
cd Vivid-VR
```
2. Create Conda Environment and Install Dependencies
```bash
# create new conda env
conda create -n Vivid-VR python=3.10
conda activate Vivid-VR
# install pytorch
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
# install python dependencies
pip install -r requirements.txt
# install easyocr [Optional, for text fix]
pip install easyocr
pip install numpy==1.26.4 # numpy2.x maybe installed when installing easyocr, which will cause conflicts
```
3. Download Models
- [**Required**] Download CogVideoX1.5-5B checkpoints from [[huggingface]](https://huggingface.co/zai-org/CogVideoX1.5-5B).
- [**Required**] Download cogvlm2-llama3-caption checkpoints from [[huggingface]](https://huggingface.co/zai-org/cogvlm2-llama3-caption).
- Please replace `modeling_cogvlm.py` with `./VRDiT/cogvlm2-llama3-caption/modeling_cogvlm.py` to remove the dependency on [pytorchvideo](https://github.com/facebookresearch/pytorchvideo).
- [**Required**] Download Vivid-VR checkpoints from [[huggingface]](https://huggingface.co/csbhr/Vivid-VR).
- [**Optional, for text fix**] Download easyocr checkpoints [[english_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/english_g2.zip) [[zh_sim_g2]](https://github.com/JaidedAI/EasyOCR/releases/download/v1.3/zh_sim_g2.zip) [[craft_mlt_25k]](https://github.com/JaidedAI/EasyOCR/releases/download/pre-v1.1.6/craft_mlt_25k.zip).
- [**Optional, for text fix**] Download Real-ESRGAN checkpoints [[RealESRGAN_x2plus]](https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth).
- Put them under the `./ckpts` folder.
The `ckpts` directory structure should be arranged as:
```
βββ ckpts
β βββ CogVideoX1.5-5B
β β βββ ...
β βββ cogvlm2-llama3-caption
β β βββ ...
β βββ Vivid-VR
β β βββ controlnet
β β βββ config.json
β β βββ diffusion_pytorch_model.safetensors
β β βββ connectors.pt
β β βββ control_feat_proj.pt
β β βββ control_patch_embed.pt
β βββ easyocr
β β βββ craft_mlt_25k.pth
β β βββ english_g2.pth
β β βββ zh_sim_g2.pth
β βββ RealESRGAN
β β βββ RealESRGAN_x2plus.pth
```
## βοΈ Quick Inference
Run the following commands to try it out:
```shell
python VRDiT/inference.py \
--ckpt_dir=./ckpts \
--cogvideox_ckpt_path=./ckpts/CogVideoX1.5-5B \
--cogvlm2_ckpt_path=./ckpts/cogvlm2-llama3-caption \
--input_dir=/dir/to/input/videos \
--output_dir=/dir/to/output/videos \
--num_temporal_process_frames=121 \ # For long video inference, if video longer than num_temporal_process_frames, aggregate sampling will be enabled in the temporal dimension
--upscale=0 \ # Optional, if set to 0, the short-size of output videos will be 1024
--textfix \ # Optional, if given, the text region will be replaced by the output of Real-ESRGAN
--save_images # Optional, if given, the video frames will be saved
```
GPU memory usage:
- For a 121-frame video, it requires approximately **43GB** GPU memory.
- If you want to reduce GPU memory usage, replace "pipe.enable_model_cpu_offload" with "pipe.enable_sequential_cpu_offload" in [`./VRDiT/inference.py`](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L407). GPU memory usage is reduced to **25GB**, but the inference time is longer.
- For the arg ["--num_temporal_process_frames"](https://github.com/csbhr/Vivid-VR/blob/50421718473396922c27e460088a140a74887dfe/VRDiT/inference.py#L319), smaller values ββrequire less GPU memory but increase inference time.
## π§ Citation
If you find our repo useful for your research, please consider citing it:
```bibtex
@article{bai2025vividvr,
title={Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration},
author={Bai, Haoran and Chen, Xiaoxu and Yang, Canqian and He, Zongyao and Deng, Sibin and Chen, Ying},
journal={arXiv preprint arXiv:2508.14483},
year={2025},
url={https://arxiv.org/abs/2508.14483}
}
```
## π License
- This repo is built based on [diffusers v0.31.0](https://github.com/huggingface/diffusers/tree/v0.31.0), which is distributed under the terms of the [Apache License 2.0](https://github.com/huggingface/diffusers/blob/main/LICENSE).
- CogVideoX1.5-5B models are distributed under the terms of the [CogVideoX License](https://huggingface.co/zai-org/CogVideoX1.5-5B/blob/main/LICENSE).
- cogvlm2-llama3-caption models are distributed under the terms of the [CogVLM2 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LICENSE&status=0) and [LLAMA3 License](https://modelscope.cn/models/ZhipuAI/cogvlm2-video-llama3-base/file/view/master?fileName=LLAMA3_LICENSE&status=0).
- Real-ESRGAN models are distributed under the terms of the [BSD 3-Clause License](https://github.com/xinntao/Real-ESRGAN/blob/master/LICENSE).
- easyocr models are distributed under the terms of the [JAIDED.AI Terms and Conditions](https://www.jaided.ai/terms/).