Spaces:
Running
on
Zero
Unified Video Editing with Temporal Reasoner
๐๏ธ See โ ๐ง Reason โ โ๏ธ Edit
๐ A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
1University of Technology Sydney, 2Zhejiang University
๐ฟ Introduction
https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6
๐ฅ News
- 2025.12.09: Paper available on arXiv.
- 2025.12.08: Release the inference code and videocof-50k weight.
- 2025.12.06: ๐ฅ Project Page and README updated!
๐ Table of Contents
- ๐ง Quick Start
- ๐ Model Zoo
- ๐ญ Results
- ๐จ Edit Comparison
- ๐ง TODO
- ๐ Acknowledgments
- ๐ License
- ๐ฎ Contact
- ๐ Citation
๐ง Quick Start
Clone the repository:
git clone https://github.com/videocof/VideoCoF.git cd VideoCoFInstall dependencies:
# 1. Create and activate a conda environment conda create -n videocof python=3.10 conda activate videocof # 2. Install PyTorch (Choose version compatible with your CUDA) # For standard GPUs (CUDA 12.1): pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 # For Hopper GPUs (e.g., H100/H800) requiring fast inference: # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 # 3. Install other dependencies pip install -r requirements.txtNote on Flash Attention: We recommend using FlashAttention-3 (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. If you are using these GPUs, please follow the official FlashAttention-3 installation guide after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).
Download Models:
Wan-2.1-T2V-14B Pretrained Weights:
```bash git lfs install git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B # Or using huggingface-cli: # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B ```VideoCoF Checkpoint:
```bash git lfs install git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight # Or using huggingface-cli: # hf download XiangpengYang/VideoCoF --local-dir videocof_weight ```Inference:
For single inference tasks:
# Object Removal sh scripts/obj_rem.sh # Object Addition sh scripts/obj_add.sh # Local Style Transfer sh scripts/local_style.shFor parallel inference:
sh scripts/parallel_infer.sh
๐ Model Zoo
Our models are available on Hugging Face:
| Model Name | Description | Link |
|---|---|---|
| VideoCoF-Base | Base model trained on 50k video pairs | Hugging Face |
๐ญ Results
Why We Need Reasoning Before Editing?
Current video editing methods typically follow two paths:
- Expert models: Rely on external masks for precision but sacrifice unification.
- Unified in-context learning models: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.
VideoCoF bridges this gap by predicting reasoning tokens before generating the target video tokens.
Key Capabilities
- Seeing, Reasoning, Editing: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
- Length Extrapolation: Trained on only 50k data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4ร length extrapolation).
- Diverse Editing Tasks: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.
Gallery Highlights
Please refer to our Project Page for the full gallery.
- Object Removal: Remove people or objects based on text prompts.
- Object Addition: Add elements like animals, objects, or people.
- Object Swap: Change specific attributes or objects.
- Local Style Transfer: Modify texture, materials or colors.
๐ง TODO
- Release paper.
- Release inference code and weights.
- Release training code.
- Release training data.
- Add Hugging Face demo.
๐ Acknowledgments
We thank the authors of related works and the open-source community VideoX-Fun and Wan for their contributions.
๐ License
This project is licensed under the Apache License 2.0.
๐ฎ Contact
For any questions, please feel free to reach out to the author Xiangpeng Yang @knightyxp, email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
๐ Citation
If you find this work useful for your research, please consider citing:
@article{yang2025videocof,
title={Unified Video Editing with Temporal Reasoner},
author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
journal={arXiv preprint arXiv:2512.07469},
year={2025}
}
