Spaces:
Running
on
Zero
Running
on
Zero
File size: 7,711 Bytes
beb2ec7 42a2bfa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
title: VideoCoF
emoji: ๐ฅ
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Unified Video Editing with Temporal Reasoner
---
<div align="center">
<h1 style="margin: 0; font-size: 2.4em;">
Unified Video Editing with Temporal Reasoner
</h1>
<h4 style="margin: 15px 0; color: #2c3e50;">
๐๏ธ See → ๐ง Reason → โ๏ธ Edit
</h4>
<h4 style="margin: 15px 0; color: #2c3e50;">
๐ A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
</h4>
[](https://huggingface.co/papers/2512.07469)
[](https://arxiv.org/abs/2512.07469)
[](https://videocof.github.io)
[](https://huggingface.co/XiangpengYang/VideoCoF)

</div>
<div align="center">
<b>
<a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
<a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
<a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
<a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
</b>
<br>
<span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
</div>
<br>
## ๐ฟ Introduction
https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6
## ๐ฅ News
- **2025.12.09**: Paper available on arXiv.
- **2025.12.08**: Release the inference code and videocof-50k weight.
- **2025.12.06**: ๐ฅ Project Page and README updated!
## ๐ Table of Contents
- [๐ง Quick Start](#-quick-start)
- [๐ Model Zoo](#-model-zoo)
- [๐ญ Results](#-results)
- [๐จ Edit Comparison](#-edit-comparison)
- [๐ง TODO](#-todo)
- [๐ Acknowledgments](#-acknowledgments)
- [๐ License](#-license)
- [๐ฎ Contact](#-contact)
- [๐ Citation](#-citation)
## ๐ง Quick Start
1. **Clone the repository:**
```bash
git clone https://github.com/videocof/VideoCoF.git
cd VideoCoF
```
2. **Install dependencies:**
```bash
# 1. Create and activate a conda environment
conda create -n videocof python=3.10
conda activate videocof
# 2. Install PyTorch (Choose version compatible with your CUDA)
# For standard GPUs (CUDA 12.1):
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
# 3. Install other dependencies
pip install -r requirements.txt
```
**Note on Flash Attention:**
We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs.
If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).
3. **Download Models:**
**Wan-2.1-T2V-14B Pretrained Weights:**
```bash
git lfs install
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
# Or using huggingface-cli:
# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
```
**VideoCoF Checkpoint:**
```bash
git lfs install
git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight
# Or using huggingface-cli:
# hf download XiangpengYang/VideoCoF --local-dir videocof_weight
```
4. **Inference:**
For single inference tasks:
```bash
# Object Removal
sh scripts/obj_rem.sh
# Object Addition
sh scripts/obj_add.sh
# Local Style Transfer
sh scripts/local_style.sh
```
For parallel inference:
```bash
sh scripts/parallel_infer.sh
```
## ๐ Model Zoo
Our models are available on Hugging Face:
| Model Name | Description | Link |
|------------|-------------|------|
| VideoCoF-Base | Base model trained on 50k video pairs | [Hugging Face](https://huggingface.co/XiangpengYang/VideoCoF) |
## ๐ญ Results
### Why We Need Reasoning Before Editing?

Current video editing methods typically follow two paths:
1. **Expert models**: Rely on external masks for precision but sacrifice unification.
2. **Unified in-context learning models**: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.
**VideoCoF** bridges this gap by predicting reasoning tokens before generating the target video tokens.
### Key Capabilities
1. **Seeing, Reasoning, Editing**: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
2. **Length Extrapolation**: Trained on only **50k** data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4× length extrapolation).
3. **Diverse Editing Tasks**: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.
### Gallery Highlights
> Please refer to our [Project Page](https://videocof.github.io) for the full gallery.
* **Object Removal**: Remove people or objects based on text prompts.
* **Object Addition**: Add elements like animals, objects, or people.
* **Object Swap**: Change specific attributes or objects.
* **Local Style Transfer**: Modify texture, materials or colors.
## ๐ง TODO
- [x] Release paper.
- [x] Release inference code and weights.
- [ ] Release training code.
- [ ] Release training data.
- [ ] Add Hugging Face demo.
## ๐ Acknowledgments
We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.
## ๐ License
This project is licensed under the [Apache License 2.0](LICENSE).
## ๐ฎ Contact
For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
## ๐ Citation
If you find this work useful for your research, please consider citing:
```bibtex
@article{yang2025videocof,
title={Unified Video Editing with Temporal Reasoner},
author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
journal={arXiv preprint arXiv:2512.07469},
year={2025}
}
```
<div align="center">
โญ **If you find this project helpful, please consider giving it a star!** โญ
</div>
## โญ๏ธ Star History
[](https://star-history.com/#knightyxp/VideoCoF&Date) |