File size: 7,711 Bytes
beb2ec7
 
 
 
 
 
 
 
 
 
 
 
 
42a2bfa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
---
title: VideoCoF
emoji: ๐ŸŽฅ
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: apache-2.0
short_description: Unified Video Editing with Temporal Reasoner
---

<div align="center">

  <h1 style="margin: 0; font-size: 2.4em;">
    Unified Video Editing with Temporal Reasoner
  </h1>

  <h4 style="margin: 15px 0; color: #2c3e50;">
    ๐Ÿ‘๏ธ See &rarr; ๐Ÿง  Reason &rarr; โœ๏ธ Edit
  </h4>

  <h4 style="margin: 15px 0; color: #2c3e50;">
    ๐Ÿš€ A Chain of Frames video editing method enbale temporal reasoning and 4x video length extrapolation with just 50k training pairs!
  </h4>

  [![Hugging Face Daily Paper](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Daily%20Paper-yellow)](https://huggingface.co/papers/2512.07469)
  [![arXiv](https://img.shields.io/badge/arXiv-2512.07469-b31b1b.svg)](https://arxiv.org/abs/2512.07469)
  [![Project Page](https://img.shields.io/badge/Project-Page-green)](https://videocof.github.io)
  [![Hugging Face Model](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow)](https://huggingface.co/XiangpengYang/VideoCoF)
  ![visitors](https://visitor-badge.laobi.icu/badge?page_id=videocof.VideoCoF&left_color=green&right_color=red)

</div>

<div align="center">
  <b>
    <a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>,
    <a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>,
    <a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>,
    <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup>
  </b>
  <br>
  <span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span>
</div>

<br>

## ๐Ÿ’ฟ Introduction

https://github.com/user-attachments/assets/26f7d347-3d6c-43cf-9645-6eb5906f6ad6

## ๐Ÿ”ฅ News

- **2025.12.09**: Paper available on arXiv.
- **2025.12.08**: Release the inference code and videocof-50k weight.
- **2025.12.06**: ๐Ÿ”ฅ Project Page and README updated!


## ๐Ÿ“‘ Table of Contents

- [๐Ÿ”ง Quick Start](#-quick-start)
- [๐Ÿ† Model Zoo](#-model-zoo)
- [๐Ÿญ Results](#-results)
- [๐ŸŽจ Edit Comparison](#-edit-comparison)
- [๐Ÿšง TODO](#-todo)
- [๐Ÿ™ Acknowledgments](#-acknowledgments)
- [๐Ÿ“œ License](#-license)
- [๐Ÿ“ฎ Contact](#-contact)
- [๐Ÿ“„ Citation](#-citation)

## ๐Ÿ”ง Quick Start

1.  **Clone the repository:**

    ```bash
    git clone https://github.com/videocof/VideoCoF.git
    cd VideoCoF
    ```

2.  **Install dependencies:**

    ```bash
    # 1. Create and activate a conda environment
    conda create -n videocof python=3.10
    conda activate videocof

    # 2. Install PyTorch (Choose version compatible with your CUDA)
    # For standard GPUs (CUDA 12.1):
    pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121
    
    # For Hopper GPUs (e.g., H100/H800) requiring fast inference:
    # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128
    
    # 3. Install other dependencies
    pip install -r requirements.txt
    ```

    **Note on Flash Attention:**
    We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. 
    If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8).


3.  **Download Models:**

    **Wan-2.1-T2V-14B Pretrained Weights:**
        
        ```bash
        git lfs install
        git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B
        
        # Or using huggingface-cli:
        # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B
        ```

    **VideoCoF Checkpoint:**
        
        ```bash
        git lfs install
        git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight

        # Or using huggingface-cli:
        # hf download XiangpengYang/VideoCoF --local-dir videocof_weight
        ```

4.  **Inference:**

    For single inference tasks:

    ```bash
    # Object Removal
    sh scripts/obj_rem.sh

    # Object Addition
    sh scripts/obj_add.sh

    # Local Style Transfer
    sh scripts/local_style.sh
    ```

    For parallel inference:

    ```bash
    sh scripts/parallel_infer.sh
    ```

## ๐Ÿ† Model Zoo

Our models are available on Hugging Face:

| Model Name | Description | Link |
|------------|-------------|------|
| VideoCoF-Base | Base model trained on 50k video pairs | [Hugging Face](https://huggingface.co/XiangpengYang/VideoCoF) |

## ๐Ÿญ Results

### Why We Need Reasoning Before Editing?
![](assets/motivation_v2.gif)

Current video editing methods typically follow two paths:
1.  **Expert models**: Rely on external masks for precision but sacrifice unification.
2.  **Unified in-context learning models**: Mask-free but often struggle with spatial accuracy due to the lack of explicit cues.

**VideoCoF** bridges this gap by predicting reasoning tokens before generating the target video tokens.

### Key Capabilities

1.  **Seeing, Reasoning, Editing**: VideoCoF adopts a "seeing, reasoning, editing" approach, ensuring edits are applied accurately to the intended targets.
2.  **Length Extrapolation**: Trained on only **50k** data (33 frames), VideoCoF demonstrates robust multi-shot editing and length generalization (e.g., 4&times; length extrapolation).
3.  **Diverse Editing Tasks**: Supports fine-grained (instance and part level, spatial aware) Object Removal, Object Addition, Object Swap, and Local Style Transfer.

### Gallery Highlights

> Please refer to our [Project Page](https://videocof.github.io) for the full gallery.

*   **Object Removal**: Remove people or objects based on text prompts.
*   **Object Addition**: Add elements like animals, objects, or people.
*   **Object Swap**: Change specific attributes or objects.
*   **Local Style Transfer**: Modify texture, materials or colors.

## ๐Ÿšง TODO

- [x] Release paper.
- [x] Release inference code and weights.
- [ ] Release training code.
- [ ] Release training data.
- [ ] Add Hugging Face demo.

## ๐Ÿ™ Acknowledgments

We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions.

## ๐Ÿ“œ License

This project is licensed under the [Apache License 2.0](LICENSE).

## ๐Ÿ“ฎ Contact

For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au

## ๐Ÿ“„ Citation

If you find this work useful for your research, please consider citing:

```bibtex
@article{yang2025videocof,
  title={Unified Video Editing with Temporal Reasoner},
  author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang},
  journal={arXiv preprint arXiv:2512.07469},
  year={2025}
}
```

<div align="center">
  โญ **If you find this project helpful, please consider giving it a star!** โญ
</div>

## โญ๏ธ Star History

[![Star History Chart](https://api.star-history.com/svg?repos=knightyxp/VideoCoF&type=Date&legend=top-left)](https://star-history.com/#knightyxp/VideoCoF&Date)