VideoCoF / README.md

update

766ff88 3 months ago

6.7 kB

	---
	license: apache-2.0
	tags:
	- video-generation
	- video-editing
	- in-context-learning
	- pytorch
	pipeline_tag: video-to-video
	library_name: transformers
	authors:
	- XiangpengYang
	- horizonwind2004
	---

	<div align="center">

	<h1 style="margin: 0; font-size: 1.8em;">
	Unified Video Editing with Temporal Reasoner
	</h1>

	<h4 style="margin: 15px 0; color: #2c3e50;">
	👁️ See → 🧠 Reason → ✏️ Edit
	</h4>

	<h4 style="margin: 15px 0; color: #2c3e50;">
	🚀 A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs!
	</h4>

	<a href="https://huggingface.co/papers/2512.07469"><img src="https://img.shields.io/badge/HuggingFace-Daily_Paper-ffd21e.svg" alt="Daily Paper"></a>
	<a href="https://arxiv.org/abs/2512.07469"><img src="https://img.shields.io/badge/arXiv-2512.07469-b31b1b.svg" alt="arXiv"></a>
	<a href="https://videocof.github.io"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a>
	<a href="https://github.com/knightyxp/VideoCoF"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>

	</div>

	<div align="center">
	<b>
	<a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
	<a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
	<a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
	<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
	</b>
	<br>
	<span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
	</div>

	<br>

	# VideoCoF: Unified Video Editing with Temporal Reasoner


	VideoCoF is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a "See → Reason → Edit", a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment.

	<div align="center">
	<a href="https://www.youtube.com/watch?v=XrYj0Qmc49w" target="_blank">
	<img src="https://img.youtube.com/vi/XrYj0Qmc49w/maxresdefault.jpg"
	alt="Video Demo"
	width="80%"
	style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);">
	</a>
	<br>
	<em>Click the image above to watch the full video on YouTube 🎬</em>
	</div>

	## 🌟 Key Capabilities
	![](assets/motivation_v2.gif)

	1. Temporal Reasoning: Adopts a unique approach where the model first identifies where and how to edit (Reasoning) before predicting the target video tokens.
	2. Data Efficiency: Achieves SOTA performance with only 50k training pairs (33 frames each).
	3. Length Extrapolation: Demonstrates robust multi-shot editing and can generalize to videos 4× longer than training samples.
	4. Versatile Editing: Supports:
	* Object Removal
	* Object Addition
	* Object Swap
	* Local Style Transfer

	## 🔧 Quick Start

	To use these weights, please refer to the official [GitHub Repository](https://github.com/knightyxp/VideoCoF) for inference code and environment setup.

	### Installation

	```bash
	git clone https://github.com/knightyxp/VideoCoF
	cd VideoCoF

	# 1. Create and activate a conda environment
	conda create -n videocof python=3.10
	conda activate videocof

	# 2. Install PyTorch (Choose version compatible with your CUDA)
	# For standard GPUs (CUDA 12.1):
	pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121

	# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
	# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

	# 3. Install other dependencies
	pip install -r requirements.txt
	```

	Note on Flash Attention:
	We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs.
	If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).

	### Download Models

	* Wan-2.1-T2V-14B Pretrained Weights:

	```bash
	git lfs install
	git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B

	# Or using huggingface-cli:
	# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
	```

	* VideoCoF Checkpoint:

	```bash
	git lfs install
	git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

	# Or using huggingface-cli:
	# hf download XiangpengYang/VideoCoF --local-dir videocof_weight
	```

	### Inference

	```bash
	export CUDA_VISIBLE_DEVICES=0
	torchrun --nproc_per_node=1 inference.py \
	--video_path assets/two_man.mp4 \
	--prompt "Remove the young man with short black hair wearing black shirt on the left." \
	--output_dir results/obj_rem \
	--model_name /scratch3/yan204/models/Wan2.1-T2V-14B \
	--seed 0 \
	--num_frames 33 \
	--source_frames 33 \
	--reasoning_frames 4 \
	--repeat_rope \
	--videocof_path videocof_weight/videocof.safetensors
	```

	For parallel inference:

	```bash
	sh scripts/parallel_infer.sh
	```

	## 🙏 Acknowledgments

	We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.

	## 📜 License

	This project is licensed under the [Apache License 2.0](LICENSE).

	## 📮 Contact

	For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au

	## 📄 Citation

	If you find this work useful for your research, please consider citing:

	```bibtex
	@article{yang2025videocof,
	title={Unified Video Editing with Temporal Reasoner},
	author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
	journal={arXiv preprint arXiv:2512.07469},
	year={2025}
	}
	```

	<div align="center">
	❤️ If you find this project helpful, please consider giving it a like! ❤️
	</div>