geyongtao
/

gvm

+---
+license: bsd-2-clause
+library_name: diffusers
+pipeline_tag: image-segmentation
+---
+# Generative Video Matting
+This repository contains the Generative Video Matting model, presented in the paper [Generative Video Matting](https://huggingface.co/papers/2508.07905). This novel approach addresses the limitations of traditional video matting by leveraging large-scale pre-training on diverse synthetic and pseudo-labeled segmentation datasets, and by introducing a method that effectively utilizes rich priors from pre-trained video diffusion models. This architecture is designed for video, ensuring strong temporal consistency and bridging the domain gap between synthetic and real-world scenes.
+<div align="center">
+<p align="center">
+<a href='https://yongtaoge.github.io/project/gvm'><img src='https://img.shields.io/badge/Project-Page-Green'></a> &nbsp;
+<a href="https://arxiv.org/abs/2508.07905"><img src="https://img.shields.io/badge/arXiv-2508.07905-b31b1b.svg"></a> &nbsp;
+<a href="https://github.com/aim-uofa/GVM"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a> &nbsp;
+<a href='https://huggingface.co/datasets/geyongtao/SynHairMan'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a> &nbsp;
+<a href="https://huggingface.co/geyongtao/gvm"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
+</p>
+</div>
+## Abstract
+Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL .
+## 🚀 Getting Started
+### Environment Requirement 🌍
+First, clone the repository:
+```bash
+git clone https://github.com/aim-uofa/GVM.git
+cd GVM
+```
+Then, we recommend you first use `conda` to create a virtual environment and install needed libraries:
+```bash
+conda create -n gvm python=3.10 -y
+conda activate gvm
+pip install -r requirements.txt
+python setup.py develop
+```
+### Download Model Weights ⬇️
+You need to download the model weights by:
+```bash
+hugginface-cli download geyongtao/gvm --local-dir data/weights
+```
+The checkpoint structure should be like:
+```
+|-- GVM
+    |-- data
+        |-- weights
+            |-- vae
+                |-- config.json
+                |-- diffusion_pytorch_model.safetensors
+            |-- unet
+                |-- config.json
+                |-- diffusion_pytorch_model.safetensors
+            |-- scheduler
+                |-- scheduler_config.json
+        |-- datasets
+        |-- demo_videos
+```
+## 🏃🏼 Run
+### Inference 📜
+You can run generative video matting with:
+```bash
+python demo.py \
+--model_base 'data/weights/' \
+--unet_base data/weights/unet \
+--lora_base data/weights/unet \
+--mode 'matte' \
+--num_frames_per_batch 8 \
+--num_interp_frames 1 \
+--num_overlap_frames 1 \
+--denoise_steps 1 \
+--decode_chunk_size 8 \
+--max_resolution 960 \
+--pretrain_type 'svd' \
+--data_dir 'data/demo_videos/xxx.mp4' \
+--output_dir 'output_path'
+```
+## 🎫 License
+For academic usage, this project is licensed under [the 2-clause BSD License](https://github.com/aim-uofa/GVM/blob/main/LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
+## 🤝 Cite Us
+If you find this work helpful for your research, please cite:
+```bibtex
+@inproceedings{ge2025gvm,
+author = {Ge, Yongtao and Xie, Kangyang and Xu, Guangkai and Ke, Li and Liu, Mingyu and Huang, Longtao and Xue, Hui and Chen, Hao and Shen, Chunhua},
+title = {Generative Video Matting},
+publisher = {Association for Computing Machinery},
+url = {https://doi.org/10.1145/3721238.3730642},
+doi = {10.1145/3721238.3730642},
+booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
+series = {SIGGRAPH Conference Papers '25}
+}
+```