Update README.md
Browse files
README.md
CHANGED
|
@@ -1,20 +1,114 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
|
| 4 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
# VOID: Video Object and Interaction Deletion
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- video-inpainting
|
| 5 |
+
- video-editing
|
| 6 |
+
- object-removal
|
| 7 |
+
- cogvideox
|
| 8 |
+
- diffusion
|
| 9 |
+
- video-generation
|
| 10 |
+
pipeline_tag: video-to-video
|
| 11 |
---
|
| 12 |
|
| 13 |
# VOID: Video Object and Interaction Deletion
|
| 14 |
+
|
| 15 |
+
<video src="https://github.com/user-attachments/assets/ad174ca0-2feb-45f9-9405-83167037d9be" width="100%" controls autoplay loop muted></video>
|
| 16 |
+
|
| 17 |
+
VOID removes objects from videos along with all interactions they induce on the scene — not just secondary effects like shadows and reflections, but **physical interactions** like objects falling when a person is removed.
|
| 18 |
+
|
| 19 |
+
**[Project Page](https://void-model.github.io/)** | **[Paper](https://arxiv.org/abs/XXXX.XXXXX)** | **[GitHub](https://github.com/netflix/void-model)** | **[Demo](https://huggingface.co/spaces/sam-motamed/VOID)**
|
| 20 |
+
|
| 21 |
+
## Quick Start
|
| 22 |
+
|
| 23 |
+
[](https://colab.research.google.com/github/netflix/void-model/blob/main/notebook.ipynb)
|
| 24 |
+
|
| 25 |
+
The included notebook handles setup, downloads models, runs inference on a sample video, and displays the result. Requires a GPU with **40GB+ VRAM** (e.g., A100).
|
| 26 |
+
|
| 27 |
+
## Model Details
|
| 28 |
+
|
| 29 |
+
VOID is built on [CogVideoX-Fun-V1.5-5b-InP](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.5-5b-InP) and fine-tuned for video inpainting with interaction-aware **quadmask** conditioning — a 4-value mask that encodes the primary object (remove), overlap regions, affected regions (falling objects, displaced items), and background (keep).
|
| 30 |
+
|
| 31 |
+
### Checkpoints
|
| 32 |
+
|
| 33 |
+
| File | Description | Required? |
|
| 34 |
+
|------|-------------|-----------|
|
| 35 |
+
| `void_pass1.safetensors` | Base inpainting model | Yes |
|
| 36 |
+
| `void_pass2.safetensors` | Warped-noise refinement for temporal consistency | Optional |
|
| 37 |
+
|
| 38 |
+
Pass 1 is sufficient for most videos. Pass 2 adds optical flow-warped latent initialization for improved temporal consistency on longer clips.
|
| 39 |
+
|
| 40 |
+
### Architecture
|
| 41 |
+
|
| 42 |
+
- **Base:** CogVideoX 3D Transformer (5B parameters)
|
| 43 |
+
- **Input:** Video + quadmask + text prompt describing the scene after removal
|
| 44 |
+
- **Resolution:** 384x672 (default)
|
| 45 |
+
- **Max frames:** 197
|
| 46 |
+
- **Scheduler:** DDIM
|
| 47 |
+
- **Precision:** BF16 with FP8 quantization for memory efficiency
|
| 48 |
+
|
| 49 |
+
## Usage
|
| 50 |
+
|
| 51 |
+
### From the Notebook
|
| 52 |
+
|
| 53 |
+
The easiest way — clone the repo and run [`notebook.ipynb`](https://github.com/netflix/void-model/blob/main/notebook.ipynb):
|
| 54 |
+
|
| 55 |
+
```bash
|
| 56 |
+
git clone https://github.com/netflix/void-model.git
|
| 57 |
+
cd void-model
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### From the CLI
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
# Install dependencies
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
|
| 66 |
+
# Download the base model
|
| 67 |
+
huggingface-cli download alibaba-pai/CogVideoX-Fun-V1.5-5b-InP \
|
| 68 |
+
--local-dir ./CogVideoX-Fun-V1.5-5b-InP
|
| 69 |
+
|
| 70 |
+
# Download VOID checkpoints
|
| 71 |
+
huggingface-cli download netflix/void-model \
|
| 72 |
+
--local-dir .
|
| 73 |
+
|
| 74 |
+
# Run Pass 1 inference on a sample
|
| 75 |
+
python inference/cogvideox_fun/predict_v2v.py \
|
| 76 |
+
--config config/quadmask_cogvideox.py \
|
| 77 |
+
--config.data.data_rootdir="./sample" \
|
| 78 |
+
--config.experiment.run_seqs="lime" \
|
| 79 |
+
--config.experiment.save_path="./outputs" \
|
| 80 |
+
--config.video_model.transformer_path="./void_pass1.safetensors"
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
### Input Format
|
| 84 |
+
|
| 85 |
+
Each video needs three files in a folder:
|
| 86 |
+
|
| 87 |
+
```
|
| 88 |
+
my-video/
|
| 89 |
+
input_video.mp4 # source video
|
| 90 |
+
quadmask_0.mp4 # 4-value mask (0=remove, 63=overlap, 127=affected, 255=keep)
|
| 91 |
+
prompt.json # {"bg": "description of scene after removal"}
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
The repo includes a mask generation pipeline (`VLM-MASK-REASONER/`) that creates quadmasks from raw videos using SAM2 + Gemini.
|
| 95 |
+
|
| 96 |
+
## Training
|
| 97 |
+
|
| 98 |
+
Trained on paired counterfactual videos generated from two sources:
|
| 99 |
+
|
| 100 |
+
- **HUMOTO** — human-object interactions rendered in Blender with physics simulation
|
| 101 |
+
- **Kubric** — object-only interactions using Google Scanned Objects
|
| 102 |
+
|
| 103 |
+
Training was run on **8x A100 80GB GPUs** using DeepSpeed ZeRO Stage 2. See the [GitHub repo](https://github.com/netflix/void-model#%EF%B8%8F-training) for full training instructions and data generation code.
|
| 104 |
+
|
| 105 |
+
## Citation
|
| 106 |
+
|
| 107 |
+
```bibtex
|
| 108 |
+
@article{motamed2026void,
|
| 109 |
+
author = {Motamed, Saman and Harvey, William and Klein, Benjamin and Van Gool, Luc and Yuan, Zhuoning and Cheng, Ta-ying},
|
| 110 |
+
title = {VOID: Video Object and Interaction Deletion},
|
| 111 |
+
journal = {arXiv preprint},
|
| 112 |
+
year = {2026},
|
| 113 |
+
}
|
| 114 |
+
```
|