Instructions to use geyongtao/gvm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use geyongtao/gvm with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("geyongtao/gvm", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
Add comprehensive model card for Generative Video Matting
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: bsd-2-clause
|
| 3 |
+
library_name: diffusers
|
| 4 |
+
pipeline_tag: image-segmentation
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Generative Video Matting
|
| 8 |
+
|
| 9 |
+
This repository contains the Generative Video Matting model, presented in the paper [Generative Video Matting](https://huggingface.co/papers/2508.07905). This novel approach addresses the limitations of traditional video matting by leveraging large-scale pre-training on diverse synthetic and pseudo-labeled segmentation datasets, and by introducing a method that effectively utilizes rich priors from pre-trained video diffusion models. This architecture is designed for video, ensuring strong temporal consistency and bridging the domain gap between synthetic and real-world scenes.
|
| 10 |
+
|
| 11 |
+
<div align="center">
|
| 12 |
+
<p align="center">
|
| 13 |
+
<a href='https://yongtaoge.github.io/project/gvm'><img src='https://img.shields.io/badge/Project-Page-Green'></a>
|
| 14 |
+
<a href="https://arxiv.org/abs/2508.07905"><img src="https://img.shields.io/badge/arXiv-2508.07905-b31b1b.svg"></a>
|
| 15 |
+
<a href="https://github.com/aim-uofa/GVM"><img src="https://img.shields.io/badge/GitHub-Code-black?logo=github"></a>
|
| 16 |
+
<a href='https://huggingface.co/datasets/geyongtao/SynHairMan'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'></a>
|
| 17 |
+
<a href="https://huggingface.co/geyongtao/gvm"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-blue'></a>
|
| 18 |
+
</p>
|
| 19 |
+
</div>
|
| 20 |
+
|
| 21 |
+
## Abstract
|
| 22 |
+
|
| 23 |
+
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach's superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. The code is available at this https URL .
|
| 24 |
+
|
| 25 |
+
## π Getting Started
|
| 26 |
+
|
| 27 |
+
### Environment Requirement π
|
| 28 |
+
|
| 29 |
+
First, clone the repository:
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
git clone https://github.com/aim-uofa/GVM.git
|
| 33 |
+
cd GVM
|
| 34 |
+
```
|
| 35 |
+
|
| 36 |
+
Then, we recommend you first use `conda` to create a virtual environment and install needed libraries:
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
conda create -n gvm python=3.10 -y
|
| 40 |
+
conda activate gvm
|
| 41 |
+
pip install -r requirements.txt
|
| 42 |
+
python setup.py develop
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### Download Model Weights β¬οΈ
|
| 46 |
+
|
| 47 |
+
You need to download the model weights by:
|
| 48 |
+
|
| 49 |
+
```bash
|
| 50 |
+
hugginface-cli download geyongtao/gvm --local-dir data/weights
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
The checkpoint structure should be like:
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
|-- GVM
|
| 57 |
+
|-- data
|
| 58 |
+
|-- weights
|
| 59 |
+
|-- vae
|
| 60 |
+
|-- config.json
|
| 61 |
+
|-- diffusion_pytorch_model.safetensors
|
| 62 |
+
|-- unet
|
| 63 |
+
|-- config.json
|
| 64 |
+
|-- diffusion_pytorch_model.safetensors
|
| 65 |
+
|-- scheduler
|
| 66 |
+
|-- scheduler_config.json
|
| 67 |
+
|-- datasets
|
| 68 |
+
|-- demo_videos
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## ππΌ Run
|
| 72 |
+
|
| 73 |
+
### Inference π
|
| 74 |
+
|
| 75 |
+
You can run generative video matting with:
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
+
python demo.py \
|
| 79 |
+
--model_base 'data/weights/' \
|
| 80 |
+
--unet_base data/weights/unet \
|
| 81 |
+
--lora_base data/weights/unet \
|
| 82 |
+
--mode 'matte' \
|
| 83 |
+
--num_frames_per_batch 8 \
|
| 84 |
+
--num_interp_frames 1 \
|
| 85 |
+
--num_overlap_frames 1 \
|
| 86 |
+
--denoise_steps 1 \
|
| 87 |
+
--decode_chunk_size 8 \
|
| 88 |
+
--max_resolution 960 \
|
| 89 |
+
--pretrain_type 'svd' \
|
| 90 |
+
--data_dir 'data/demo_videos/xxx.mp4' \
|
| 91 |
+
--output_dir 'output_path'
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## π« License
|
| 95 |
+
|
| 96 |
+
For academic usage, this project is licensed under [the 2-clause BSD License](https://github.com/aim-uofa/GVM/blob/main/LICENSE). For commercial inquiries, please contact [Chunhua Shen](mailto:chhshen@gmail.com).
|
| 97 |
+
|
| 98 |
+
## π€ Cite Us
|
| 99 |
+
|
| 100 |
+
If you find this work helpful for your research, please cite:
|
| 101 |
+
|
| 102 |
+
```bibtex
|
| 103 |
+
@inproceedings{ge2025gvm,
|
| 104 |
+
author = {Ge, Yongtao and Xie, Kangyang and Xu, Guangkai and Ke, Li and Liu, Mingyu and Huang, Longtao and Xue, Hui and Chen, Hao and Shen, Chunhua},
|
| 105 |
+
title = {Generative Video Matting},
|
| 106 |
+
publisher = {Association for Computing Machinery},
|
| 107 |
+
url = {https://doi.org/10.1145/3721238.3730642},
|
| 108 |
+
doi = {10.1145/3721238.3730642},
|
| 109 |
+
booktitle = {Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers},
|
| 110 |
+
series = {SIGGRAPH Conference Papers '25}
|
| 111 |
+
}
|
| 112 |
+
```
|