| --- |
| license: apache-2.0 |
| pipeline_tag: image-to-video |
| --- |
| |
| # ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis |
|
|
| [**Project Page**](https://keruzheng.github.io/ReImagine-Project/) | [**Paper (arXiv)**](https://arxiv.org/abs/2604.19720) | [**Code**](https://github.com/Taited/ReImagine) | [**Demo**](https://taited-reimagine.hf.space/) |
|
|
| **ReImagine** is a framework for controllable high-quality human video generation. It revisits the problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis. This approach decouples appearance modeling from temporal consistency. |
|
|
| The system utilizes a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, followed by a training-free temporal refinement stage based on a pretrained video diffusion model. |
|
|
| ## Getting Started |
|
|
| ### Installation |
|
|
| ```bash |
| conda create -n reimagine python=3.10 |
| conda activate reimagine |
| pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124 |
| pip install -e . |
| ``` |
|
|
| ### Pretrained Weights |
|
|
| ReImagine utilizes base models and specific LoRA weights. You can download the weights using the Hugging Face CLI: |
|
|
| ```bash |
| # Download base FLUX.1 model |
| hf download black-forest-labs/FLUX.1-Kontext-dev \ |
| --local-dir ./models/FLUX.1-Kontext-dev \ |
| --exclude "flux1-kontext-dev.safetensors" \ |
| --exclude "vae/**" |
| |
| # Download ControlNet |
| hf download jasperai/Flux.1-dev-Controlnet-Surface-Normals \ |
| --local-dir ./models/Flux.1-dev-Controlnet-Surface-Normals |
| |
| # Download ReImagine LoRA Weights |
| hf download taited/ReImagine-Pretrained --local-dir ./models/ReImagine-Pretrained |
| ``` |
|
|
| ## Inference |
|
|
| To perform image-first synthesis, use the provided inference script: |
|
|
| ```bash |
| python inference_img.py |
| ``` |
| This script requires a wide reference image (front and back views) and a normal map generated from SMPL-X. For video synthesis, the temporal-refinement stage is used to ensure consistency across frames. |
|
|
| ## Citation |
|
|
| If you find this project useful, please consider citing the paper: |
|
|
| ```bibtex |
| @article{sun2025rethinking, |
| title={ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis}, |
| author={Sun, Zhengwentai and Zheng, Keru and Li, Chenghong and Liao, Hongjie and Yang, Xihe and Li, Heyuan and Zhi, Yihao and Ning, Shuliang and Cui, Shuguang and Han, Xiaoguang}, |
| journal={arXiv preprint arXiv:2604.19720}, |
| year={2026}, |
| url={https://arxiv.org/abs/2604.19720v1} |
| } |
| ``` |