|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: diffusers |
|
|
pipeline_tag: image-to-video |
|
|
tags: |
|
|
- video-frame-interpolation |
|
|
- vfi |
|
|
- diffusion-transformer |
|
|
--- |
|
|
|
|
|
# LDF-VFI: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers |
|
|
|
|
|
This repository contains the weights for **LDF-VFI** (Local Diffusion Forcing for Video Frame Interpolation), as introduced in the paper [Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers](https://huggingface.co/papers/2601.14959). |
|
|
|
|
|
[[Paper](https://arxiv.org/abs/2601.14959)] [[Project Page](https://xypeng9903.github.io/ldf-vfi-web/)] [[GitHub](https://github.com/xypeng9903/LDF-VFI)] |
|
|
|
|
|
## Introduction |
|
|
|
|
|
Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named **L**ocal **D**iffusion **F**orcing for **V**ideo **F**rame **I**nterpolation (LDF-VFI). |
|
|
|
|
|
Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. LDF-VFI incorporates sparse, local attention and tiled VAE encoding, enabling efficient processing of long sequences and generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Auto-regressive Diffusion Transformer**: Models the entire video sequence for long-range temporal coherence. |
|
|
- **Skip-concatenate Sampling**: A novel strategy to maintain temporal stability and mitigate error accumulation. |
|
|
- **Resolution Generalization**: Supports arbitrary spatial resolutions (including 4K) at inference time. |
|
|
- **Enhanced Conditional VAE**: Leverages multi-scale features from input videos to improve reconstruction fidelity. |
|
|
|
|
|
## Usage |
|
|
|
|
|
For installation and usage instructions, please refer to the [official GitHub repository](https://github.com/xypeng9903/LDF-VFI). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work helpful, please cite: |
|
|
```bibtex |
|
|
@misc{peng2026holisticmodelingvideoframe, |
|
|
title={Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers}, |
|
|
author={Xinyu Peng and Han Li and Yuyang Huang and Ziyang Zheng and Yaoming Wang and Xin Chen and Wenrui Dai and Chenglin Li and Junni Zou and Hongkai Xiong}, |
|
|
year={2026}, |
|
|
eprint={2601.14959}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2601.14959}, |
|
|
} |
|
|
``` |