--- title: VideoMaMa - Video Matting with Mask Guidance emoji: 🎬 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.0.0 app_file: app.py pinned: false license: apache-2.0 --- # 🎬 VideoMaMa: Video Matting with Mask Guidance An interactive demo for high-quality video matting using sparse mask guidance. This demo combines SAM2 for automatic object tracking with our VideoMaMa model for generating alpha mattes. ## 🌟 Features - **Single-Click Object Selection**: Simply click on the object you want to extract in the first frame - **Automatic Tracking**: SAM2 automatically tracks your selected object through all frames - **High-Quality Matting**: VideoMaMa generates smooth, temporally-consistent alpha mattes - **Flexible Input**: Upload your own video or try our provided samples - **Customizable**: Adjust augmentation settings for different scenarios ## 🚀 How to Use 1. **Upload a video** or **select from samples** 2. **Click on the object** you want to extract in the first frame (displayed in the interface) 3. Optionally adjust **augmentation settings** in the advanced options 4. Click **"Generate Matting"** and wait for processing 5. View your results: output video, comparison images, and mask track ## 🔧 Installation (Local Setup) If you want to run this demo locally: ```bash # Install dependencies pip install -r requirements.txt # Add sample videos to samples/ directory (optional) # Run the demo python app.py ``` ## 🎯 Tips for Best Results - **Click Precisely**: Click on the center of the object you want to extract - **Clear Objects**: Works best with distinct foreground objects - **Video Length**: For faster processing, use shorter videos (< 5 seconds) - **Augmentations**: - Use "polygon" for cleaner geometric masks - Enable temporal augmentation for challenging videos - Try "bounding box" for very simple selections ## 📚 Technical Details ### Model Architecture - **Base Model**: Stable Video Diffusion (SVD-XT) - **Conditioning**: RGB frames + VAE-encoded masks - **UNet**: Fine-tuned with additional mask conditioning channels - **Processing**: Chunked inference (16 frames per chunk) ### SAM2 Integration - Uses SAM2 video predictor for mask tracking - Propagates mask from single click point through entire video - Generates temporally consistent segmentation masks ## 🤝 Contributing If you encounter issues or have suggestions: 1. Check that all model checkpoints are correctly placed 2. Ensure your GPU has sufficient VRAM 3. Try reducing video length or resolution for testing ## 🙏 Acknowledgments - **SAM2**: Meta AI's Segment Anything 2 - **Stable Video Diffusion**: Stability AI's video generation model - **Gradio**: For the amazing UI framework ## 📧 Contact For questions or issues, please open an issue on our GitHub repository. --- **Note**: This demo is for research purposes. Processing times may vary based on video length and available compute resources.