onecat-ai
/

LDF-VFI

video-frame-interpolation

diffusion-transformer

Model card Files Files and versions

LDF-VFI / README.md

onecat-ai's picture

Improve model card and add metadata (#1)

8781688 verified 2 days ago

|

history blame contribute delete

2.61 kB

	---
	license: apache-2.0
	library_name: diffusers
	pipeline_tag: image-to-video
	tags:
	- video-frame-interpolation
	- vfi
	- diffusion-transformer
	---

	# LDF-VFI: Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

	This repository contains the weights for LDF-VFI (Local Diffusion Forcing for Video Frame Interpolation), as introduced in the paper [Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers](https://huggingface.co/papers/2601.14959).

	[[Paper](https://arxiv.org/abs/2601.14959)] [[Project Page](https://xypeng9903.github.io/ldf-vfi-web/)] [[GitHub](https://github.com/xypeng9903/LDF-VFI)]

	## Introduction

	Existing video frame interpolation (VFI) methods often adopt a frame-centric approach, processing videos as independent short segments (e.g., triplets), which leads to temporal inconsistencies and motion artifacts. To overcome this, we propose a holistic, video-centric paradigm named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI).

	Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence. LDF-VFI incorporates sparse, local attention and tiled VAE encoding, enabling efficient processing of long sequences and generalization to arbitrary spatial resolutions (e.g., 4K) at inference without retraining.

	## Key Features

	- Auto-regressive Diffusion Transformer: Models the entire video sequence for long-range temporal coherence.
	- Skip-concatenate Sampling: A novel strategy to maintain temporal stability and mitigate error accumulation.
	- Resolution Generalization: Supports arbitrary spatial resolutions (including 4K) at inference time.
	- Enhanced Conditional VAE: Leverages multi-scale features from input videos to improve reconstruction fidelity.

	## Usage

	For installation and usage instructions, please refer to the [official GitHub repository](https://github.com/xypeng9903/LDF-VFI).

	## Citation

	If you find this work helpful, please cite:
	```bibtex
	@misc{peng2026holisticmodelingvideoframe,
	title={Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers},
	author={Xinyu Peng and Han Li and Yuyang Huang and Ziyang Zheng and Yaoming Wang and Xin Chen and Wenrui Dai and Chenglin Li and Junni Zou and Hongkai Xiong},
	year={2026},
	eprint={2601.14959},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2601.14959},
	}
	```