Spaces:
Running
on
Zero
Running
on
Zero
Commit
·
d3fc679
1
Parent(s):
beb2ec7
clear readme
Browse files
README.md
CHANGED
|
@@ -9,214 +9,4 @@ app_file: app.py
|
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
short_description: Unified Video Editing with Temporal Reasoner
|
| 12 |
-
---
|
| 13 |
-
|
| 14 |
-
<div align="center">
|
| 15 |
-
|
| 16 |
-
<h1 style="margin: 0; font-size: 2.4em;">
|
| 17 |
-
Unified Video Editing with Temporal Reasoner
|
| 18 |
-
</h1>
|
| 19 |
-
|
| 20 |
-
<h4 style="margin: 15px 0; color: #2c3e50;">
|
| 21 |
-
👁️ See → 🧠 Reason → ✏️ Edit
|
| 22 |
-
</h4>
|
| 23 |
-
|
| 24 |
-
<h4 style="margin: 15px 0; color: #2c3e50;">
|
| 25 |
-
🚀 A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
|
| 26 |
-
</h4>
|
| 27 |
-
|
| 28 |
-
[](https://huggingface.co/papers/2512.07469)
|
| 29 |
-
[](https://arxiv.org/abs/2512.07469)
|
| 30 |
-
[](https://videocof.github.io)
|
| 31 |
-
[](https://huggingface.co/XiangpengYang/VideoCoF)
|
| 32 |
-

|
| 33 |
-
|
| 34 |
-
</div>
|
| 35 |
-
|
| 36 |
-
<div align="center">
|
| 37 |
-
<b>
|
| 38 |
-
<a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
|
| 39 |
-
<a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
|
| 40 |
-
<a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
|
| 41 |
-
<a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
|
| 42 |
-
<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
|
| 43 |
-
<a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
|
| 44 |
-
</b>
|
| 45 |
-
<br>
|
| 46 |
-
<span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
|
| 47 |
-
</div>
|
| 48 |
-
|
| 49 |
-
<br>
|
| 50 |
-
|
| 51 |
-
## 💿 Introduction
|
| 52 |
-
|
| 53 |
-
https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6
|
| 54 |
-
|
| 55 |
-
## 🔥 News
|
| 56 |
-
|
| 57 |
-
- **2025.12.09**: Paper available on arXiv.
|
| 58 |
-
- **2025.12.08**: Release the inference code and videocof-50k weight.
|
| 59 |
-
- **2025.12.06**: 🔥 Project Page and README updated!
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
## 📑 Table of Contents
|
| 63 |
-
|
| 64 |
-
- [🔧 Quick Start](#-quick-start)
|
| 65 |
-
- [🏆 Model Zoo](#-model-zoo)
|
| 66 |
-
- [🍭 Results](#-results)
|
| 67 |
-
- [🎨 Edit Comparison](#-edit-comparison)
|
| 68 |
-
- [🚧 TODO](#-todo)
|
| 69 |
-
- [🙏 Acknowledgments](#-acknowledgments)
|
| 70 |
-
- [📜 License](#-license)
|
| 71 |
-
- [📮 Contact](#-contact)
|
| 72 |
-
- [📄 Citation](#-citation)
|
| 73 |
-
|
| 74 |
-
## 🔧 Quick Start
|
| 75 |
-
|
| 76 |
-
1. **Clone the repository:**
|
| 77 |
-
|
| 78 |
-
```bash
|
| 79 |
-
git clone https://github.com/videocof/VideoCoF.git
|
| 80 |
-
cd VideoCoF
|
| 81 |
-
```
|
| 82 |
-
|
| 83 |
-
2. **Install dependencies:**
|
| 84 |
-
|
| 85 |
-
```bash
|
| 86 |
-
# 1. Create and activate a conda environment
|
| 87 |
-
conda create -n videocof python=3.10
|
| 88 |
-
conda activate videocof
|
| 89 |
-
|
| 90 |
-
# 2. Install PyTorch (Choose version compatible with your CUDA)
|
| 91 |
-
# For standard GPUs (CUDA 12.1):
|
| 92 |
-
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
|
| 93 |
-
|
| 94 |
-
# For Hopper GPUs (e.g., H100/H800) requiring fast inference:
|
| 95 |
-
# pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
|
| 96 |
-
|
| 97 |
-
# 3. Install other dependencies
|
| 98 |
-
pip install -r requirements.txt
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
**Note on Flash Attention:**
|
| 102 |
-
We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs.
|
| 103 |
-
If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
3. **Download Models:**
|
| 107 |
-
|
| 108 |
-
**Wan-2.1-T2V-14B Pretrained Weights:**
|
| 109 |
-
|
| 110 |
-
```bash
|
| 111 |
-
git lfs install
|
| 112 |
-
git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
|
| 113 |
-
|
| 114 |
-
# Or using huggingface-cli:
|
| 115 |
-
# hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
|
| 116 |
-
```
|
| 117 |
-
|
| 118 |
-
**VideoCoF Checkpoint:**
|
| 119 |
-
|
| 120 |
-
```bash
|
| 121 |
-
git lfs install
|
| 122 |
-
git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight
|
| 123 |
-
|
| 124 |
-
# Or using huggingface-cli:
|
| 125 |
-
# hf download XiangpengYang/VideoCoF --local-dir videocof_weight
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
4. **Inference:**
|
| 129 |
-
|
| 130 |
-
For single inference tasks:
|
| 131 |
-
|
| 132 |
-
```bash
|
| 133 |
-
# Object Removal
|
| 134 |
-
sh scripts/obj_rem.sh
|
| 135 |
-
|
| 136 |
-
# Object Addition
|
| 137 |
-
sh scripts/obj_add.sh
|
| 138 |
-
|
| 139 |
-
# Local Style Transfer
|
| 140 |
-
sh scripts/local_style.sh
|
| 141 |
-
```
|
| 142 |
-
|
| 143 |
-
For parallel inference:
|
| 144 |
-
|
| 145 |
-
```bash
|
| 146 |
-
sh scripts/parallel_infer.sh
|
| 147 |
-
```
|
| 148 |
-
|
| 149 |
-
## 🏆 Model Zoo
|
| 150 |
-
|
| 151 |
-
Our models are available on Hugging Face:
|
| 152 |
-
|
| 153 |
-
| Model Name | Description | Link |
|
| 154 |
-
|------------|-------------|------|
|
| 155 |
-
| VideoCoF-Base | Base model trained on 50k video pairs | [Hugging Face](https://huggingface.co/XiangpengYang/VideoCoF) |
|
| 156 |
-
|
| 157 |
-
## 🍭 Results
|
| 158 |
-
|
| 159 |
-
### Why We Need Reasoning Before Editing?
|
| 160 |
-

|
| 161 |
-
|
| 162 |
-
Current video editing methods typically follow two paths:
|
| 163 |
-
1. **Expert models**: Rely on external masks for precision but sacrifice unification.
|
| 164 |
-
2. **Unified in-context learning models**: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.
|
| 165 |
-
|
| 166 |
-
**VideoCoF** bridges this gap by predicting reasoning tokens before generating the target video tokens.
|
| 167 |
-
|
| 168 |
-
### Key Capabilities
|
| 169 |
-
|
| 170 |
-
1. **Seeing, Reasoning, Editing**: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
|
| 171 |
-
2. **Length Extrapolation**: Trained on only **50k** data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4× length extrapolation).
|
| 172 |
-
3. **Diverse Editing Tasks**: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.
|
| 173 |
-
|
| 174 |
-
### Gallery Highlights
|
| 175 |
-
|
| 176 |
-
> Please refer to our [Project Page](https://videocof.github.io) for the full gallery.
|
| 177 |
-
|
| 178 |
-
* **Object Removal**: Remove people or objects based on text prompts.
|
| 179 |
-
* **Object Addition**: Add elements like animals, objects, or people.
|
| 180 |
-
* **Object Swap**: Change specific attributes or objects.
|
| 181 |
-
* **Local Style Transfer**: Modify texture, materials or colors.
|
| 182 |
-
|
| 183 |
-
## 🚧 TODO
|
| 184 |
-
|
| 185 |
-
- [x] Release paper.
|
| 186 |
-
- [x] Release inference code and weights.
|
| 187 |
-
- [ ] Release training code.
|
| 188 |
-
- [ ] Release training data.
|
| 189 |
-
- [ ] Add Hugging Face demo.
|
| 190 |
-
|
| 191 |
-
## 🙏 Acknowledgments
|
| 192 |
-
|
| 193 |
-
We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.
|
| 194 |
-
|
| 195 |
-
## 📜 License
|
| 196 |
-
|
| 197 |
-
This project is licensed under the [Apache License 2.0](LICENSE).
|
| 198 |
-
|
| 199 |
-
## 📮 Contact
|
| 200 |
-
|
| 201 |
-
For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au
|
| 202 |
-
|
| 203 |
-
## 📄 Citation
|
| 204 |
-
|
| 205 |
-
If you find this work useful for your research, please consider citing:
|
| 206 |
-
|
| 207 |
-
```bibtex
|
| 208 |
-
@article{yang2025videocof,
|
| 209 |
-
title={Unified Video Editing with Temporal Reasoner},
|
| 210 |
-
author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
|
| 211 |
-
journal={arXiv preprint arXiv:2512.07469},
|
| 212 |
-
year={2025}
|
| 213 |
-
}
|
| 214 |
-
```
|
| 215 |
-
|
| 216 |
-
<div align="center">
|
| 217 |
-
⭐ **If you find this project helpful, please consider giving it a star!** ⭐
|
| 218 |
-
</div>
|
| 219 |
-
|
| 220 |
-
## ⭐️ Star History
|
| 221 |
-
|
| 222 |
-
[](https://star-history.com/#knightyxp/VideoCoF&Date)
|
|
|
|
| 9 |
pinned: false
|
| 10 |
license: apache-2.0
|
| 11 |
short_description: Unified Video Editing with Temporal Reasoner
|
| 12 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|