Add comprehensive model card for Generative Video Matting

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +112 -0
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: bsd-2-clause
3
+ library_name: diffusers
4
+ pipeline_tag: image-segmentation
5
+ ---
6
+
7
+ # Generative Video Matting
8
+
9
+ This repository contains the Generative Video Matting model, presented in the paper [Generative Video Matting](https://huggingface.co/papers/2508.07905). This novel approach addresses the limitations of traditional video matting by leveraging large-scale pre-training on diverse synthetic and pseudo-labeled segmentation datasets, and by introducing a method that effectively utilizes rich priors from pre-trained video diffusion models. This architecture is designed for video, ensuring strong temporal consistency and bridging the domain gap between synthetic and real-world scenes.
10
+
11
+ <div align="center">
12
+ <p align="center">
13
+ <a href='https://yongtaoge.github.io/project/gvm'><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp;
14
+ <a href="https://arxiv.org/abs/2508.07905"><img src="https://img.shields.io/badge/arXiv-2508.07905-b31b1b.svg"></a> &nbsp;
15
+ <a href="https://github.com/aim-uofa/GVM"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> &nbsp;
16
+ <a href='https://huggingface.co/datasets/geyongtao/SynHairMan'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a> &nbsp;
17
+ <a href="https://huggingface.co/geyongtao/gvm"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
18
+ </p>
19
+ </div>
20
+
21
+ ## Abstract
22
+
23
+ Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL .
24
+
25
+ ## πŸš€ Getting Started
26
+
27
+ ### Environment Requirement 🌍
28
+
29
+ First, clone the repository:
30
+
31
+ ```bash
32
+ git clone https://github.com/aim-uofa/GVM.git
33
+ cd GVM
34
+ ```
35
+
36
+ Then, we recommend you first use `conda` to create a virtual environment and install needed libraries:
37
+
38
+ ```bash
39
+ conda create -n gvm python=3.10 -y
40
+ conda activate gvm
41
+ pip install -r requirements.txt
42
+ python setup.py develop
43
+ ```
44
+
45
+ ### Download Model Weights ⬇️
46
+
47
+ You need to download the model weights by:
48
+
49
+ ```bash
50
+ hugginface-cli download geyongtao/gvm --local-dir data/weights
51
+ ```
52
+
53
+ The checkpoint structure should be like:
54
+
55
+ ```
56
+ |-- GVM
57
+ |-- data
58
+ |-- weights
59
+ |-- vae
60
+ |-- config.json
61
+ |-- diffusion_pytorch_model.safetensors
62
+ |-- unet
63
+ |-- config.json
64
+ |-- diffusion_pytorch_model.safetensors
65
+ |-- scheduler
66
+ |-- scheduler_config.json
67
+ |-- datasets
68
+ |-- demo_videos
69
+ ```
70
+
71
+ ## πŸƒπŸΌ Run
72
+
73
+ ### Inference πŸ“œ
74
+
75
+ You can run generative video matting with:
76
+
77
+ ```bash
78
+ python demo.py \
79
+ --model_base 'data/weights/' \
80
+ --unet_base data/weights/unet \
81
+ --lora_base data/weights/unet \
82
+ --mode 'matte' \
83
+ --num_frames_per_batch 8 \
84
+ --num_interp_frames 1 \
85
+ --num_overlap_frames 1 \
86
+ --denoise_steps 1 \
87
+ --decode_chunk_size 8 \
88
+ --max_resolution 960 \
89
+ --pretrain_type 'svd' \
90
+ --data_dir 'data/demo_videos/xxx.mp4' \
91
+ --output_dir 'output_path'
92
+ ```
93
+
94
+ ## 🎫 License
95
+
96
+ For academic usage, this project is licensed under [the 2-clause BSD License](https://github.com/aim-uofa/GVM/blob/main/LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
97
+
98
+ ## 🀝 Cite Us
99
+
100
+ If you find this work helpful for your research, please cite:
101
+
102
+ ```bibtex
103
+ @inproceedings{ge2025gvm,
104
+ author = {Ge, Yongtao and Xie, Kangyang and Xu, Guangkai and Ke, Li and Liu, Mingyu and Huang, Longtao and Xue, Hui and Chen, Hao and Shen, Chunhua},
105
+ title = {Generative Video Matting},
106
+ publisher = {Association for Computing Machinery},
107
+ url = {https://doi.org/10.1145/3721238.3730642},
108
+ doi = {10.1145/3721238.3730642},
109
+ booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
110
+ series = {SIGGRAPH Conference Papers '25}
111
+ }
112
+ ```