--- license: apache-2.0 library_name: diffusers pipeline_tag: image-to-video tags: - video-frame-interpolation - vfi - diffusion-transformer --- # LDF-VFI: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers This repository contains the weights for **LDF-VFI** (Local Diffusion Forcing for Video Frame Interpolation), as introduced in the paper [Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers](https://huggingface.co/papers/2601.14959). [[Paper](https://arxiv.org/abs/2601.14959)] [[Project Page](https://xypeng9903.github.io/ldf-vfi-web/)] [[GitHub](https://github.com/xypeng9903/LDF-VFI)] ## Introduction Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named **L**ocal **D**iffusion **F**orcing for **V**ideo **F**rame **I**nterpolation (LDF-VFI). Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. LDF-VFI incorporates sparse, local attention and tiled VAE encoding, enabling efficient processing of long sequences and generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining. ## Key Features - **Auto-regressive Diffusion Transformer**: Models the entire video sequence for long-range temporal coherence. - **Skip-concatenate Sampling**: A novel strategy to maintain temporal stability and mitigate error accumulation. - **Resolution Generalization**: Supports arbitrary spatial resolutions (including 4K) at inference time. - **Enhanced Conditional VAE**: Leverages multi-scale features from input videos to improve reconstruction fidelity. ## Usage For installation and usage instructions, please refer to the [official GitHub repository](https://github.com/xypeng9903/LDF-VFI). ## Citation If you find this work helpful, please cite: ```bibtex @misc{peng2026holisticmodelingvideoframe, title={Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers}, author={Xinyu Peng and Han Li and Yuyang Huang and Ziyang Zheng and Yaoming Wang and Xin Chen and Wenrui Dai and Chenglin Li and Junni Zou and Hongkai Xiong}, year={2026}, eprint={2601.14959}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2601.14959}, } ```