---
license: apache-2.0
library_name: diffusers
pipeline_tag: image-to-video
tags:
- video-frame-interpolation
- vfi
- diffusion-transformer
---

# LDF-VFI: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

This repository contains the weights for **LDF-VFI** (Local Diffusion Forcing for Video Frame Interpolation), as introduced in the paper [Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers](https://huggingface.co/papers/2601.14959).

[[Paper](https://arxiv.org/abs/2601.14959)] [[Project Page](https://xypeng9903.github.io/ldf-vfi-web/)] [[GitHub](https://github.com/xypeng9903/LDF-VFI)]

## Introduction

Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named **L**ocal **D**iffusion **F**orcing for **V**ideo **F**rame **I**nterpolation (LDF-VFI). 

Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. LDF-VFI incorporates sparse, local attention and tiled VAE encoding, enabling efficient processing of long sequences and generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining.

## Key Features

- **Auto-regressive Diffusion Transformer**: Models the entire video sequence for long-range temporal coherence.
- **Skip-concatenate Sampling**: A novel strategy to maintain temporal stability and mitigate error accumulation.
- **Resolution Generalization**: Supports arbitrary spatial resolutions (including 4K) at inference time.
- **Enhanced Conditional VAE**: Leverages multi-scale features from input videos to improve reconstruction fidelity.

## Usage

For installation and usage instructions, please refer to the [official GitHub repository](https://github.com/xypeng9903/LDF-VFI).

## Citation

If you find this work helpful, please cite:
```bibtex
@misc{peng2026holisticmodelingvideoframe,
      title={Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers}, 
      author={Xinyu Peng and Han Li and Yuyang Huang and Ziyang Zheng and Yaoming Wang and Xin Chen and Wenrui Dai and Chenglin Li and Junni Zou and Hongkai Xiong},
      year={2026},
      eprint={2601.14959},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.14959}, 
}
```