| | --- |
| | license: apache-2.0 |
| | tags: |
| | - video-generation |
| | - video-editing |
| | - in-context-learning |
| | - pytorch |
| | pipeline_tag: video-to-video |
| | library_name: transformers |
| | authors: |
| | - XiangpengYang |
| | - horizonwind2004 |
| | --- |
| | |
| | <div align="center"> |
| |
|
| | <h1 style="margin: 0; font-size: 1.8em;"> |
| | Unified Video Editing with Temporal Reasoner |
| | </h1> |
| | |
| | <h4 style="margin: 15px 0; color: #2c3e50;"> |
| | ๐๏ธ See → ๐ง Reason → โ๏ธ Edit |
| | </h4> |
| | |
| | <h4 style="margin: 15px 0; color: #2c3e50;"> |
| | ๐ A Chain of Frames editing method enbale temporal reasoning and 4x video length generalization with just 50k training pairs! |
| | </h4> |
| | |
| | <a href="https://huggingface.co/papers/2512.07469"><img src="https://img.shields.io/badge/HuggingFace-Daily_Paper-ffd21e.svg" alt="Daily Paper"></a> |
| | <a href="https://arxiv.org/abs/2512.07469"><img src="https://img.shields.io/badge/arXiv-2512.07469-b31b1b.svg" alt="arXiv"></a> |
| | <a href="https://videocof.github.io"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a> |
| | <a href="https://github.com/knightyxp/VideoCoF"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a> |
| |
|
| | </div> |
| |
|
| | <div align="center"> |
| | <b> |
| | <a href="https://scholar.google.com/citations?user=reiIeYMAAAAJ">Xiangpeng Yang</a><sup>1</sup>, |
| | <a href="https://horizonwind2004.github.io/">Ji Xie</a><sup>2</sup>, |
| | <a href="https://scholar.google.com/citations?user=OvfI_HMAAAAJ">Yiyuan Yang</a><sup>1</sup>, |
| | <a href="https://scholar.google.com/citations?user=zfeWd6gAAAAJ">Yan Huang</a><sup>1</sup>, |
| | <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Min Xu</a><sup>1</sup>, |
| | <a href="https://scholar.google.com/citations?user=sCuACdkAAAAJ">Qiang Wu</a><sup>1</sup> |
| | </b> |
| | <br> |
| | <span style="font-size: 1em; color: #555;"><sup>1</sup>University of Technology Sydney, <sup>2</sup>Zhejiang University</span> |
| | </div> |
| | |
| | <br> |
| |
|
| | # VideoCoF: Unified Video Editing with Temporal Reasoner |
| |
|
| |
|
| | **VideoCoF** is a unified video editing model that bridges the gap between expert models (precise but restricted) and unified in-context models (flexible but spatially inaccurate). By introducing a **"See → Reason → Edit"**, a Chain-of-Frames paradigm, VideoCoF predicts reasoning tokens before generating the target video tokens, thereby removing the need for user-provided masks while achieving precise instruction to-region alignment. |
| |
|
| | <div align="center"> |
| | <a href="https://www.youtube.com/watch?v=XrYj0Qmc49w" target="_blank"> |
| | <img src="https://img.youtube.com/vi/XrYj0Qmc49w/maxresdefault.jpg" |
| | alt="Video Demo" |
| | width="80%" |
| | style="max-width:900px; border-radius:10px; box-shadow:0 0 10px rgba(0,0,0,0.15);"> |
| | </a> |
| | <br> |
| | <em>Click the image above to watch the full video on YouTube ๐ฌ</em> |
| | </div> |
| | |
| | ## ๐ Key Capabilities |
| |  |
| |
|
| | 1. **Temporal Reasoning**: Adopts a unique approach where the model first identifies *where* and *how* to edit (Reasoning) before predicting the target video tokens. |
| | 2. **Data Efficiency**: Achieves SOTA performance with only **50k training pairs** (33 frames each). |
| | 3. **Length Extrapolation**: Demonstrates robust multi-shot editing and can generalize to videos **4× longer** than training samples. |
| | 4. **Versatile Editing**: Supports: |
| | * Object Removal |
| | * Object Addition |
| | * Object Swap |
| | * Local Style Transfer |
| |
|
| | ## ๐ง Quick Start |
| |
|
| | To use these weights, please refer to the official [GitHub Repository](https://github.com/knightyxp/VideoCoF) for inference code and environment setup. |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | git clone https://github.com/knightyxp/VideoCoF |
| | cd VideoCoF |
| | |
| | # 1. Create and activate a conda environment |
| | conda create -n videocof python=3.10 |
| | conda activate videocof |
| | |
| | # 2. Install PyTorch (Choose version compatible with your CUDA) |
| | # For standard GPUs (CUDA 12.1): |
| | pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu121 |
| | |
| | # For Hopper GPUs (e.g., H100/H800) requiring fast inference: |
| | # pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128 |
| | |
| | # 3. Install other dependencies |
| | pip install -r requirements.txt |
| | ``` |
| |
|
| | **Note on Flash Attention:** |
| | We recommend using **FlashAttention-3** (currently beta) for optimal performance, especially on NVIDIA H100/H800 GPUs. |
| | If you are using these GPUs, please follow the [official FlashAttention-3 installation guide](https://github.com/Dao-AILab/flash-attention?tab=readme-ov-file#flashattention-3-beta-release) after installing the compatible PyTorch version (e.g., PyTorch 2.8 + CUDA 12.8). |
| |
|
| | ### Download Models |
| |
|
| | * **Wan-2.1-T2V-14B Pretrained Weights:** |
| | |
| | ```bash |
| | git lfs install |
| | git clone https://huggingface.co/Wan-AI/Wan2.1-T2V-14B |
| | |
| | # Or using huggingface-cli: |
| | # hf download Wan-AI/Wan2.1-T2V-14B --local-dir Wan2.1-T2V-14B |
| | ``` |
| | |
| | * **VideoCoF Checkpoint:** |
| | |
| | ```bash |
| | git lfs install |
| | git clone https://huggingface.co/XiangpengYang/VideoCoF videocof_weight |
| | |
| | # Or using huggingface-cli: |
| | # hf download XiangpengYang/VideoCoF --local-dir videocof_weight |
| | ``` |
| | |
| | ### Inference |
| |
|
| | ```bash |
| | export CUDA_VISIBLE_DEVICES=0 |
| | torchrun --nproc_per_node=1 inference.py \ |
| | --video_path assets/two_man.mp4 \ |
| | --prompt "Remove the young man with short black hair wearing black shirt on the left." \ |
| | --output_dir results/obj_rem \ |
| | --model_name /scratch3/yan204/models/Wan2.1-T2V-14B \ |
| | --seed 0 \ |
| | --num_frames 33 \ |
| | --source_frames 33 \ |
| | --reasoning_frames 4 \ |
| | --repeat_rope \ |
| | --videocof_path videocof_weight/videocof.safetensors |
| | ``` |
| |
|
| | For parallel inference: |
| |
|
| | ```bash |
| | sh scripts/parallel_infer.sh |
| | ``` |
| |
|
| | ## ๐ Acknowledgments |
| |
|
| | We thank the authors of related works and the open-source community [VideoX-Fun](https://github.com/aigc-apps/VideoX-Fun) and [Wan](https://github.com/Wan-Video/Wan2.1) for their contributions. |
| |
|
| | ## ๐ License |
| |
|
| | This project is licensed under the [Apache License 2.0](LICENSE). |
| |
|
| | ## ๐ฎ Contact |
| |
|
| | For any questions, please feel free to reach out to the author Xiangpeng Yang [@knightyxp](https://github.com/knightyxp), email: knightyxp@gmail.com/Xiangpeng.Yang@student.uts.edu.au |
| |
|
| | ## ๐ Citation |
| |
|
| | If you find this work useful for your research, please consider citing: |
| |
|
| | ```bibtex |
| | @article{yang2025videocof, |
| | title={Unified Video Editing with Temporal Reasoner}, |
| | author={Yang, Xiangpeng and Xie, Ji and Yang, Yiyuan and Huang, Yan and Xu, Min and Wu, Qiang}, |
| | journal={arXiv preprint arXiv:2512.07469}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | <div align="center"> |
| | โค๏ธ **If you find this project helpful, please consider giving it a like!** โค๏ธ |
| | </div> |