---
base_model:
- Wan-AI/Wan2.1-VACE-1.3B
license: apache-2.0
pipeline_tag: video-to-video
library_name: diffusers
---
## Overview

Removing objects from videos remains difficult in the presence of real-world imperfections such as shadows, abrupt motion, and defective masks. Existing diffusion-based video inpainting models often struggle to maintain temporal stability and visual consistency under these challenges. We propose **Stable Video Object Removal (SVOR)**, a robust framework that achieves shadow-free, flicker-free, and mask-defect-tolerant removal through three key designs: (1) **Mask Union for Stable Erasure (MUSE)**, a windowed union strategy applied during temporal mask downsampling to preserve all target regions observed within each window, effectively handling abrupt motion and reducing missed removals; (2) **Denoising-Aware Segmentation (DA-Seg)**, a lightweight segmentation head on a decoupled side branch equipped with {Denoising-Aware AdaLN } and trained with mask degradation to provide an internal diffusion-aware localization prior without affecting content generation; and (3) **Curriculum Two-Stage Training**: where Stage I performs self-supervised pretraining on unpaired real-background videos with online random masks to learn realistic background and temporal priors, and Stage II refines on synthetic pairs using mask degradation and side-effect-weighted losses, jointly removing objects and their associated shadows/reflections while improving cross-domain robustness. Extensive experiments show that SVOR attains new state-of-the-art results across multiple datasets and degraded-mask benchmarks, advancing video object removal from ideal settings toward real-world applications.
## Results
For more visual results, go checkout our