Add model card for ReImagine
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-to-video
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
|
| 7 |
+
|
| 8 |
+
[**Project Page**](https://keruzheng.github.io/ReImagine-Project/) | [**Paper (arXiv)**](https://arxiv.org/abs/2604.19720) | [**Code**](https://github.com/Taited/ReImagine) | [**Demo**](https://taited-reimagine.hf.space/)
|
| 9 |
+
|
| 10 |
+
**ReImagine** is a framework for controllable high-quality human video generation. It revisits the problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis. This approach decouples appearance modeling from temporal consistency.
|
| 11 |
+
|
| 12 |
+
The system utilizes a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, followed by a training-free temporal refinement stage based on a pretrained video diffusion model.
|
| 13 |
+
|
| 14 |
+
## Getting Started
|
| 15 |
+
|
| 16 |
+
### Installation
|
| 17 |
+
|
| 18 |
+
```bash
|
| 19 |
+
conda create -n reimagine python=3.10
|
| 20 |
+
conda activate reimagine
|
| 21 |
+
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu124
|
| 22 |
+
pip install -e .
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
### Pretrained Weights
|
| 26 |
+
|
| 27 |
+
ReImagine utilizes base models and specific LoRA weights. You can download the weights using the Hugging Face CLI:
|
| 28 |
+
|
| 29 |
+
```bash
|
| 30 |
+
# Download base FLUX.1 model
|
| 31 |
+
hf download black-forest-labs/FLUX.1-Kontext-dev \
|
| 32 |
+
--local-dir ./models/FLUX.1-Kontext-dev \
|
| 33 |
+
--exclude "flux1-kontext-dev.safetensors" \
|
| 34 |
+
--exclude "vae/**"
|
| 35 |
+
|
| 36 |
+
# Download ControlNet
|
| 37 |
+
hf download jasperai/Flux.1-dev-Controlnet-Surface-Normals \
|
| 38 |
+
--local-dir ./models/Flux.1-dev-Controlnet-Surface-Normals
|
| 39 |
+
|
| 40 |
+
# Download ReImagine LoRA Weights
|
| 41 |
+
hf download taited/ReImagine-Pretrained --local-dir ./models/ReImagine-Pretrained
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
## Inference
|
| 45 |
+
|
| 46 |
+
To perform image-first synthesis, use the provided inference script:
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
python inference_img.py
|
| 50 |
+
```
|
| 51 |
+
This script requires a wide reference image (front and back views) and a normal map generated from SMPL-X. For video synthesis, the temporal-refinement stage is used to ensure consistency across frames.
|
| 52 |
+
|
| 53 |
+
## Citation
|
| 54 |
+
|
| 55 |
+
If you find this project useful, please consider citing the paper:
|
| 56 |
+
|
| 57 |
+
```bibtex
|
| 58 |
+
@article{sun2025rethinking,
|
| 59 |
+
title={ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis},
|
| 60 |
+
author={Sun, Zhengwentai and Zheng, Keru and Li, Chenghong and Liao, Hongjie and Yang, Xihe and Li, Heyuan and Zhi, Yihao and Ning, Shuliang and Cui, Shuguang and Han, Xiaoguang},
|
| 61 |
+
journal={arXiv preprint arXiv:2604.19720},
|
| 62 |
+
year={2026},
|
| 63 |
+
url={https://arxiv.org/abs/2604.19720v1}
|
| 64 |
+
}
|
| 65 |
+
```
|