Spaces:

XiangpengYang
/

VideoCoF

Running on Zero

App Files Files Community

VideoCoF / README.md

XiangpengYang

Update README metadata for Space

beb2ec7 2 days ago

preview code

raw

history blame

7.71 kB

	---
	title: VideoCoF
	emoji: 🎥
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.44.1
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Unified Video Editing with Temporal Reasoner
	---

	<div align="center">

	<h1 style="margin: 0; font-size: 2.4em;">
	Unified Video Editing with Temporal Reasoner
	</h1>

	<h4 style="margin: 15px 0; color: #2c3e50;">
	👁️ See → 🧠 Reason → ✏️ Edit
	</h4>

	<h4 style="margin: 15px 0; color: #2c3e50;">
	🚀 A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
	</h4>

	[![Hugging Face Daily Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Daily%20Paper-yellow)](https://huggingface.co/papers/2512.07469)
	[![arXiv](https://img.shields.io/badge/arXiv-2512.07469-b31b1b.svg)](https://arxiv.org/abs/2512.07469)
	[![Project Page](https://img.shields.io/badge/Project-Page-green)](https://videocof.github.io)
	[![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/XiangpengYang/VideoCoF)
	![visitors](https://visitor-badge.laobi.icu/badge?page_id=videocof.VideoCoF&left_color=green&right_color=red)

	</div>

	<div align="center">
	<b>
	<a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
	<a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
	<a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
	</b>
	<br>
	<span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
	</div>

	<br>

	## 💿 Introduction

	https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6

	## 🔥 News

	- 2025.12.09: Paper available on arXiv.
	- 2025.12.08: Release the inference code and videocof-50k weight.
	- 2025.12.06: 🔥 Project Page and README updated!


	## 📑 Table of Contents

	- [🔧 Quick Start](#-quick-start)
	- [🏆 Model Zoo](#-model-zoo)
	- [🍭 Results](#-results)
	- [🎨 Edit Comparison](#-edit-comparison)
	- [🚧 TODO](#-todo)
	- [🙏 Acknowledgments](#-acknowledgments)
	- [📜 License](#-license)
	- [📮 Contact](#-contact)
	- [📄 Citation](#-citation)

	## 🔧 Quick Start

	1. Clone the repository:

	```bash
	git clone https://github.com/videocof/VideoCoF.git
	cd VideoCoF
	```

	2. Install dependencies:

	```bash
	# 1. Create and activate a conda environment
	conda create -n videocof python=3.10
	conda activate videocof

	# 2. Install PyTorch (Choose version compatible with your CUDA)
	# For standard GPUs (CUDA 12.1):
	pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

	# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
	# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

	# 3. Install other dependencies
	pip install -r requirements.txt
	```

	Note on Flash Attention:
	We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs.
	If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).


	3. Download Models:

	Wan-2.1-T2V-14B Pretrained Weights:

	```bash
	git lfs install
	git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B

	# Or using huggingface-cli:
	# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
	```

	VideoCoF Checkpoint:

	```bash
	git lfs install
	git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

	# Or using huggingface-cli:
	# hf download XiangpengYang/VideoCoF --local-dir videocof_weight
	```

	4. Inference:

	For single inference tasks:

	```bash
	# Object Removal
	sh scripts/obj_rem.sh

	# Object Addition
	sh scripts/obj_add.sh

	# Local Style Transfer
	sh scripts/local_style.sh
	```

	For parallel inference:

	```bash
	sh scripts/parallel_infer.sh
	```

	## 🏆 Model Zoo

	Our models are available on Hugging Face:

	\| Model Name \| Description \| Link \|
	\|------------\|-------------\|------\|
	\| VideoCoF-Base \| Base model trained on 50k video pairs \| [Hugging Face](https://huggingface.co/XiangpengYang/VideoCoF) \|

	## 🍭 Results

	### Why We Need Reasoning Before Editing?
	![](assets/motivation_v2.gif)

	Current video editing methods typically follow two paths:
	1. Expert models: Rely on external masks for precision but sacrifice unification.
	2. Unified in-context learning models: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.

	VideoCoF bridges this gap by predicting reasoning tokens before generating the target video tokens.

	### Key Capabilities

	1. Seeing, Reasoning, Editing: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
	2. Length Extrapolation: Trained on only 50k data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4× length extrapolation).
	3. Diverse Editing Tasks: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.

	### Gallery Highlights

	> Please refer to our [Project Page](https://videocof.github.io) for the full gallery.

	* Object Removal: Remove people or objects based on text prompts.
	* Object Addition: Add elements like animals, objects, or people.
	* Object Swap: Change specific attributes or objects.
	* Local Style Transfer: Modify texture, materials or colors.

	## 🚧 TODO

	- [x] Release paper.
	- [x] Release inference code and weights.
	- [ ] Release training code.
	- [ ] Release training data.
	- [ ] Add Hugging Face demo.

	## 🙏 Acknowledgments

	We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.

	## 📜 License

	This project is licensed under the [Apache License 2.0](LICENSE).

	## 📮 Contact

	For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au

	## 📄 Citation

	If you find this work useful for your research, please consider citing:

	```bibtex
	@article{yang2025videocof,
	title={Unified Video Editing with Temporal Reasoner},
	author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
	journal={arXiv preprint arXiv:2512.07469},
	year={2025}
	}
	```

	<div align="center">
	⭐ If you find this project helpful, please consider giving it a star! ⭐
	</div>

	## ⭐️ Star History

	[![Star History Chart](https://api.star-history.com/svg?repos=knightyxp/VideoCoF&type=Date&legend=top-left)](https://star-history.com/#knightyxp/VideoCoF&Date)