Spaces:

RAM2118
/

VideoMaMa-Custom

Build error

App Files Files Community

RAM2118 commited on Feb 21

Commit

8d777b1

verified ·

1 Parent(s): 2255092

Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

.gitignore +65 -0
.gradio/certificate.pem +31 -0
README.md +85 -6
app.py +561 -0
download_checkpoints.sh +78 -0
enhanced_ui.py +72 -0
pipeline_svd_mask.py +1038 -0
requirements.txt +31 -0
sam2_hiera_l.yaml +124 -0
sam2_wrapper.py +172 -0
sam2_wrapper_hf.py +196 -0
tools/__init__.py +1 -0
tools/base_segmenter.py +68 -0
tools/interact_tools.py +121 -0
tools/painter.py +126 -0
videomama_wrapper.py +88 -0
videomama_wrapper_hf.py +110 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,65 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+venv/
+env/
+ENV/
+.venv
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+# Gradio
+flagged/
+# Temporary files
+*.tmp
+temp/
+temp_*/
+*.log
+# Model checkpoints (download separately)
+checkpoints/*.pt
+checkpoints/*.pth
+checkpoints/*.safetensors
+checkpoints/*.bin
+# Videos
+samples/*.mp4
+samples/*.avi
+samples/*.mov
+*.mp4
+*.avi
+*.mov
+# OS
+.DS_Store
+Thumbs.db
+*.bak
+# Jupyter
+.ipynb_checkpoints/

.gradio/certificate.pem ADDED Viewed

	@@ -0,0 +1,31 @@

+-----BEGIN CERTIFICATE-----
+MIIFazCCA1OgAwIBAgIRAIIQz7DSQONZRGPgu2OCiwAwDQYJKoZIhvcNAQELBQAw
+TzELMAkGA1UEBhMCVVMxKTAnBgNVBAoTIEludGVybmV0IFNlY3VyaXR5IFJlc2Vh
+cmNoIEdyb3VwMRUwEwYDVQQDEwxJU1JHIFJvb3QgWDEwHhcNMTUwNjA0MTEwNDM4
+WhcNMzUwNjA0MTEwNDM4WjBPMQswCQYDVQQGEwJVUzEpMCcGA1UEChMgSW50ZXJu
+ZXQgU2VjdXJpdHkgUmVzZWFyY2ggR3JvdXAxFTATBgNVBAMTDElTUkcgUm9vdCBY
+MTCCAiIwDQYJKoZIhvcNAQEBBQADggIPADCCAgoCggIBAK3oJHP0FDfzm54rVygc
+h77ct984kIxuPOZXoHj3dcKi/vVqbvYATyjb3miGbESTtrFj/RQSa78f0uoxmyF+
+0TM8ukj13Xnfs7j/EvEhmkvBioZxaUpmZmyPfjxwv60pIgbz5MDmgK7iS4+3mX6U
+A5/TR5d8mUgjU+g4rk8Kb4Mu0UlXjIB0ttov0DiNewNwIRt18jA8+o+u3dpjq+sW
+T8KOEUt+zwvo/7V3LvSye0rgTBIlDHCNAymg4VMk7BPZ7hm/ELNKjD+Jo2FR3qyH
+B5T0Y3HsLuJvW5iB4YlcNHlsdu87kGJ55tukmi8mxdAQ4Q7e2RCOFvu396j3x+UC
+B5iPNgiV5+I3lg02dZ77DnKxHZu8A/lJBdiB3QW0KtZB6awBdpUKD9jf1b0SHzUv
+KBds0pjBqAlkd25HN7rOrFleaJ1/ctaJxQZBKT5ZPt0m9STJEadao0xAH0ahmbWn
+OlFuhjuefXKnEgV4We0+UXgVCwOPjdAvBbI+e0ocS3MFEvzG6uBQE3xDk3SzynTn
+jh8BCNAw1FtxNrQHusEwMFxIt4I7mKZ9YIqioymCzLq9gwQbooMDQaHWBfEbwrbw
+qHyGO0aoSCqI3Haadr8faqU9GY/rOPNk3sgrDQoo//fb4hVC1CLQJ13hef4Y53CI
+rU7m2Ys6xt0nUW7/vGT1M0NPAgMBAAGjQjBAMA4GA1UdDwEB/wQEAwIBBjAPBgNV
+HRMBAf8EBTADAQH/MB0GA1UdDgQWBBR5tFnme7bl5AFzgAiIyBpY9umbbjANBgkq
+hkiG9w0BAQsFAAOCAgEAVR9YqbyyqFDQDLHYGmkgJykIrGF1XIpu+ILlaS/V9lZL
+ubhzEFnTIZd+50xx+7LSYK05qAvqFyFWhfFQDlnrzuBZ6brJFe+GnY+EgPbk6ZGQ
+3BebYhtF8GaV0nxvwuo77x/Py9auJ/GpsMiu/X1+mvoiBOv/2X/qkSsisRcOj/KK
+NFtY2PwByVS5uCbMiogziUwthDyC3+6WVwW6LLv3xLfHTjuCvjHIInNzktHCgKQ5
+ORAzI4JMPJ+GslWYHb4phowim57iaztXOoJwTdwJx4nLCgdNbOhdjsnvzqvHu7Ur
+TkXWStAmzOVyyghqpZXjFaH3pO3JLF+l+/+sKAIuvtd7u+Nxe5AW0wdeRlN8NwdC
+jNPElpzVmbUq4JUagEiuTDkHzsxHpFKVK7q4+63SM1N95R1NbdWhscdCb+ZAJzVc
+oyi3B43njTOQ5yOf+1CceWxG1bQVs5ZufpsMljq4Ui0/1lvh+wjChP4kqKOJ2qxq
+4RgqsahDYVvTH9w7jXbyLeiNdd8XM2w9U/t7y0Ff/9yi0GE44Za4rF2LN9d11TPA
+mRGunUHBcnWEvgJBQl9nJEiU0Zsnvgc/ubhPgXRR4Xq37Z0j4r7g1SgEEzwxA57d
+emyPxgcYxn/eR44/KJ4EBs+lVDR3veyJm+kXQ99b21/+jh5Xos1AnX5iItreGCc=
+-----END CERTIFICATE-----

README.md CHANGED Viewed

@@ -1,12 +1,91 @@
 ---
-title: VideoMaMa Custom
-emoji: 💻
-colorFrom: red
-colorTo: pink
 sdk: gradio
-sdk_version: 6.6.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: VideoMaMa - Video Matting with Mask Guidance
+emoji: 🎬
+colorFrom: blue
+colorTo: purple
 sdk: gradio
+sdk_version: 4.0.0
 app_file: app.py
 pinned: false
+license: apache-2.0
 ---
+# 🎬 VideoMaMa: Video Matting with Mask Guidance
+An interactive demo for high-quality video matting using sparse mask guidance. This demo combines SAM2 for automatic object tracking with our VideoMaMa model for generating alpha mattes.
+## 🌟 Features
+- **Single-Click Object Selection**: Simply click on the object you want to extract in the first frame
+- **Automatic Tracking**: SAM2 automatically tracks your selected object through all frames
+- **High-Quality Matting**: VideoMaMa generates smooth, temporally-consistent alpha mattes
+- **Flexible Input**: Upload your own video or try our provided samples
+- **Customizable**: Adjust augmentation settings for different scenarios
+## 🚀 How to Use
+1. **Upload a video** or **select from samples**
+2. **Click on the object** you want to extract in the first frame (displayed in the interface)
+3. Optionally adjust **augmentation settings** in the advanced options
+4. Click **"Generate Matting"** and wait for processing
+5. View your results: output video, comparison images, and mask track
+## 🔧 Installation (Local Setup)
+If you want to run this demo locally:
+```bash
+# Install dependencies
+pip install -r requirements.txt
+# Add sample videos to samples/ directory (optional)
+# Run the demo
+python app.py
+```
+## 🎯 Tips for Best Results
+- **Click Precisely**: Click on the center of the object you want to extract
+- **Clear Objects**: Works best with distinct foreground objects
+- **Video Length**: For faster processing, use shorter videos (< 5 seconds)
+- **Augmentations**:
+  - Use "polygon" for cleaner geometric masks
+  - Enable temporal augmentation for challenging videos
+  - Try "bounding box" for very simple selections
+## 📚 Technical Details
+### Model Architecture
+- **Base Model**: Stable Video Diffusion (SVD-XT)
+- **Conditioning**: RGB frames + VAE-encoded masks
+- **UNet**: Fine-tuned with additional mask conditioning channels
+- **Processing**: Chunked inference (16 frames per chunk)
+### SAM2 Integration
+- Uses SAM2 video predictor for mask tracking
+- Propagates mask from single click point through entire video
+- Generates temporally consistent segmentation masks
+## 🤝 Contributing
+If you encounter issues or have suggestions:
+1. Check that all model checkpoints are correctly placed
+2. Ensure your GPU has sufficient VRAM
+3. Try reducing video length or resolution for testing
+## 🙏 Acknowledgments
+- **SAM2**: Meta AI's Segment Anything 2
+- **Stable Video Diffusion**: Stability AI's video generation model
+- **Gradio**: For the amazing UI framework
+## 📧 Contact
+For questions or issues, please open an issue on our GitHub repository.
+---
+**Note**: This demo is for research purposes. Processing times may vary based on video length and available compute resources.

app.py ADDED Viewed

	@@ -0,0 +1,561 @@

+"""
+VideoMaMa Gradio Demo
+Interactive video matting with SAM2 mask tracking
+"""
+import sys
+sys.path.append("../")
+sys.path.append("../../")
+import os
+import json
+import time
+import cv2
+import torch
+import numpy as np
+import gradio as gr
+from PIL import Image
+from pathlib import Path
+from sam2_wrapper import load_sam2_tracker
+from videomama_wrapper import load_videomama_pipeline, videomama
+from tools.painter import mask_painter, point_painter
+import warnings
+warnings.filterwarnings("ignore")
+# Global models
+sam2_tracker = None
+videomama_pipeline = None
+# Constants
+MASK_COLOR = 3
+MASK_ALPHA = 0.7
+CONTOUR_COLOR = 1
+CONTOUR_WIDTH = 5
+POINT_COLOR_POS = 8   # Positive points - orange
+POINT_COLOR_NEG = 1   # Negative points - red
+POINT_ALPHA = 0.9
+POINT_RADIUS = 15
+def initialize_models():
+    """Initialize SAM2 and VideoMaMa models"""
+    global sam2_tracker, videomama_pipeline
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Using device: {device}")
+    # Load SAM2
+    sam2_tracker = load_sam2_tracker(device=device)
+    # Load VideoMaMa
+    videomama_pipeline = load_videomama_pipeline(device=device)
+    print("All models initialized successfully!")
+def extract_frames_from_video(video_path, max_frames=24):
+    """
+    Extract frames from video file
+    Args:
+        video_path: Path to video file
+        max_frames: Maximum number of frames to extract (default: 24)
+    Returns:
+        frames: List of numpy arrays (H,W,3), uint8 RGB
+        adjusted_fps: Adjusted FPS for output video to maintain normal playback speed
+    """
+    cap = cv2.VideoCapture(video_path)
+    original_fps = cap.get(cv2.CAP_PROP_FPS)
+    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
+    # Read all frames first
+    all_frames = []
+    while cap.isOpened():
+        ret, frame = cap.read()
+        if not ret:
+            break
+        # Convert BGR to RGB
+        frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
+        all_frames.append(frame_rgb)
+    cap.release()
+    # If video has more frames than max_frames, randomly sample
+    if len(all_frames) > max_frames:
+        print(f"Video has {len(all_frames)} frames, randomly sampling {max_frames} frames...")
+        # Sort indices to maintain temporal order
+        sampled_indices = sorted(np.random.choice(len(all_frames), max_frames, replace=False))
+        frames = [all_frames[i] for i in sampled_indices]
+        print(f"Sampled frame indices: {sampled_indices}")
+        # Adjust FPS to maintain normal playback speed
+        # If we sampled N frames from M total frames, adjust FPS proportionally
+        adjusted_fps = original_fps * (len(frames) / len(all_frames))
+    else:
+        frames = all_frames
+        adjusted_fps = original_fps
+        print(f"Video has {len(frames)} frames (≤ {max_frames}), using all frames")
+    print(f"Using {len(frames)} frames from video (Original FPS: {original_fps:.2f}, Adjusted FPS: {adjusted_fps:.2f})")
+    return frames, adjusted_fps
+def get_prompt(click_state, click_input):
+    """
+    Convert click input to prompt format
+    Args:
+        click_state: [[points], [labels]]
+        click_input: JSON string "[[x, y, label]]"
+    Returns:
+        Updated click_state
+    """
+    inputs = json.loads(click_input)
+    points = click_state[0]
+    labels = click_state[1]
+    for input_item in inputs:
+        points.append(input_item[:2])
+        labels.append(input_item[2])
+    click_state[0] = points
+    click_state[1] = labels
+    return click_state
+def load_video(video_input, video_state, num_frames):
+    """
+    Load video and extract first frame for mask generation
+    """
+    # Clean up old output files if they exist
+    if video_state is not None and "output_paths" in video_state:
+        cleanup_old_videos(video_state["output_paths"])
+    if video_input is None:
+        return video_state, None, \
+               gr.update(visible=False), gr.update(visible=False), \
+               gr.update(visible=False), gr.update(visible=False)
+    # Extract frames with user-specified number
+    frames, fps = extract_frames_from_video(video_input, max_frames=num_frames)
+    if len(frames) == 0:
+        return video_state, None, \
+               gr.update(visible=False), gr.update(visible=False), \
+               gr.update(visible=False), gr.update(visible=False)
+    # Initialize video state
+    video_state = {
+        "frames": frames,
+        "fps": fps,
+        "first_frame_mask": None,
+        "masks": None,
+    }
+    first_frame_pil = Image.fromarray(frames[0])
+    return video_state, first_frame_pil, \
+           gr.update(visible=True), gr.update(visible=True), \
+           gr.update(visible=True), gr.update(visible=False)
+def sam_refine(video_state, point_prompt, click_state, evt: gr.SelectData):
+    """
+    Add click and update mask on first frame
+    Args:
+        video_state: Dictionary with video data
+        point_prompt: "Positive" or "Negative"
+        click_state: [[points], [labels]]
+        evt: Gradio SelectData event with click coordinates
+    """
+    if video_state is None or "frames" not in video_state:
+        return None, video_state, click_state
+    # Add new click
+    x, y = evt.index[0], evt.index[1]
+    label = 1 if point_prompt == "Positive" else 0
+    click_state[0].append([x, y])
+    click_state[1].append(label)
+    print(f"Added {point_prompt} click at ({x}, {y}). Total clicks: {len(click_state[0])}")
+    # Generate mask with SAM2
+    first_frame = video_state["frames"][0]
+    mask = sam2_tracker.get_first_frame_mask(
+        frame=first_frame,
+        points=click_state[0],
+        labels=click_state[1]
+    )
+    # Store mask in video state
+    video_state["first_frame_mask"] = mask
+    # Visualize mask and points
+    painted_image = mask_painter(
+        first_frame.copy(),
+        mask,
+        MASK_COLOR,
+        MASK_ALPHA,
+        CONTOUR_COLOR,
+        CONTOUR_WIDTH
+    )
+    # Paint positive points
+    positive_points = np.array([click_state[0][i] for i in range(len(click_state[0]))
+                               if click_state[1][i] == 1])
+    if len(positive_points) > 0:
+        painted_image = point_painter(
+            painted_image,
+            positive_points,
+            POINT_COLOR_POS,
+            POINT_ALPHA,
+            POINT_RADIUS,
+            CONTOUR_COLOR,
+            CONTOUR_WIDTH
+        )
+    # Paint negative points
+    negative_points = np.array([click_state[0][i] for i in range(len(click_state[0]))
+                               if click_state[1][i] == 0])
+    if len(negative_points) > 0:
+        painted_image = point_painter(
+            painted_image,
+            negative_points,
+            POINT_COLOR_NEG,
+            POINT_ALPHA,
+            POINT_RADIUS,
+            CONTOUR_COLOR,
+            CONTOUR_WIDTH
+        )
+    painted_pil = Image.fromarray(painted_image)
+    return painted_pil, video_state, click_state
+def clear_clicks(video_state, click_state):
+    """Clear all clicks and reset to original first frame"""
+    click_state = [[], []]
+    if video_state is not None and "frames" in video_state:
+        first_frame = video_state["frames"][0]
+        video_state["first_frame_mask"] = None
+        return Image.fromarray(first_frame), video_state, click_state
+    return None, video_state, click_state
+def propagate_masks(video_state, click_state):
+    """
+    Propagate first frame mask through entire video using SAM2
+    """
+    if video_state is None or "frames" not in video_state:
+        return video_state, "No video loaded", gr.update(visible=False)
+    if len(click_state[0]) == 0:
+        return video_state, "⚠️ Please add at least one point first", gr.update(visible=False)
+    frames = video_state["frames"]
+    # Track through video
+    print(f"Tracking object through {len(frames)} frames...")
+    masks = sam2_tracker.track_video(
+        frames=frames,
+        points=click_state[0],
+        labels=click_state[1]
+    )
+    video_state["masks"] = masks
+    status_msg = f"✓ Generated {len(masks)} masks. Ready to run VideoMaMa!"
+    return video_state, status_msg, gr.update(visible=True)
+def run_videomama_with_sam2(video_state, click_state):
+    """
+    Run SAM2 propagation and VideoMaMa inference together
+    """
+    if video_state is None or "frames" not in video_state:
+        return video_state, None, None, None, "⚠️ No video loaded"
+    if len(click_state[0]) == 0:
+        return video_state, None, None, None, "⚠️ Please add at least one point first"
+    frames = video_state["frames"]
+    # Step 1: Track through video with SAM2
+    print(f"🎯 Tracking object through {len(frames)} frames with SAM2...")
+    masks = sam2_tracker.track_video(
+        frames=frames,
+        points=click_state[0],
+        labels=click_state[1]
+    )
+    video_state["masks"] = masks
+    print(f"✓ Generated {len(masks)} masks")
+    # Step 2: Run VideoMaMa
+    print(f"🎨 Running VideoMaMa on {len(frames)} frames...")
+    output_frames = videomama(videomama_pipeline, frames, masks)
+    # Save output videos
+    output_dir = Path("outputs")
+    output_dir.mkdir(exist_ok=True)
+    timestamp = int(time.time())
+    output_video_path = output_dir / f"output_{timestamp}.mp4"
+    mask_video_path = output_dir / f"masks_{timestamp}.mp4"
+    greenscreen_path = output_dir / f"greenscreen_{timestamp}.mp4"
+    # Save matting result
+    save_video(output_frames, output_video_path, video_state["fps"])
+    # Save mask video (for visualization)
+    mask_frames_rgb = [np.stack([m, m, m], axis=-1) for m in masks]
+    save_video(mask_frames_rgb, mask_video_path, video_state["fps"])
+    # Create greenscreen composite: RGB * VideoMaMa_alpha + green * (1 - VideoMaMa_alpha)
+    # VideoMaMa output_frames already contain the alpha matte result
+    greenscreen_frames = []
+    for orig_frame, output_frame in zip(frames, output_frames):
+        # Extract alpha matte from VideoMaMa output
+        # VideoMaMa outputs matted foreground, we use its intensity as alpha
+        gray = cv2.cvtColor(output_frame, cv2.COLOR_RGB2GRAY)
+        alpha = np.clip(gray.astype(np.float32) / 255.0, 0, 1)
+        alpha_3ch = np.stack([alpha, alpha, alpha], axis=-1)
+        # Create green background
+        green_bg = np.zeros_like(orig_frame)
+        green_bg[:, :] = [156, 251, 165]  # Green screen color
+        # Composite: original_RGB * alpha + green * (1 - alpha)
+        composite = (orig_frame.astype(np.float32) * alpha_3ch +
+                    green_bg.astype(np.float32) * (1 - alpha_3ch)).astype(np.uint8)
+        greenscreen_frames.append(composite)
+    save_video(greenscreen_frames, greenscreen_path, video_state["fps"])
+    status_msg = f"✓ Complete! Generated {len(output_frames)} frames."
+    # Store paths for cleanup later
+    video_state["output_paths"] = [str(output_video_path), str(mask_video_path), str(greenscreen_path)]
+    return video_state, str(output_video_path), str(mask_video_path), str(greenscreen_path), status_msg
+def save_video(frames, output_path, fps):
+    """Save frames as video file"""
+    if len(frames) == 0:
+        return
+    height, width = frames[0].shape[:2]
+    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
+    out = cv2.VideoWriter(str(output_path), fourcc, fps, (width, height))
+    for frame in frames:
+        if len(frame.shape) == 2:  # Grayscale
+            frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2BGR)
+        else:  # RGB
+            frame = cv2.cvtColor(frame, cv2.COLOR_RGB2BGR)
+        out.write(frame)
+    out.release()
+    print(f"Saved video to {output_path}")
+def cleanup_old_videos(video_paths):
+    """Remove old output videos to save storage space"""
+    if video_paths is None:
+        return
+    for path in video_paths:
+        try:
+            if os.path.exists(path):
+                os.remove(path)
+                print(f"Cleaned up: {path}")
+        except Exception as e:
+            print(f"Failed to remove {path}: {e}")
+def cleanup_old_outputs(max_age_minutes=30):
+    """
+    Remove output files older than max_age_minutes to prevent storage overflow
+    This runs periodically to clean up abandoned files
+    """
+    output_dir = Path("outputs")
+    if not output_dir.exists():
+        return
+    current_time = time.time()
+    max_age_seconds = max_age_minutes * 60
+    for file_path in output_dir.glob("*.mp4"):
+        try:
+            file_age = current_time - file_path.stat().st_mtime
+            if file_age > max_age_seconds:
+                file_path.unlink()
+                print(f"Cleaned up old file: {file_path} (age: {file_age/60:.1f} minutes)")
+        except Exception as e:
+            print(f"Failed to clean up {file_path}: {e}")
+def restart():
+    """Reset all states"""
+    return None, [[], []], None, \
+           gr.update(visible=False), gr.update(visible=False), \
+           gr.update(visible=False), None, None, None, ""
+# CSS styling
+custom_css = """
+.gradio-container {width: 90% !important; margin: 0 auto;}
+.title-text {text-align: center; font-size: 48px; font-weight: bold;
+             background: linear-gradient(to right, #8b5cf6, #10b981);
+             -webkit-background-clip: text; -webkit-text-fill-color: transparent;}
+.description-text {text-align: center; font-size: 18px; margin: 20px 0;}
+button {border-radius: 8px !important;}
+.green_button {background-color: #10b981 !important; color: white !important;}
+.red_button {background-color: #ef4444 !important; color: white !important;}
+.run_matting_button {
+    background: linear-gradient(135deg, #667eea 0%, #764ba2 50%, #f093fb 100%) !important;
+    color: white !important;
+    font-weight: bold !important;
+    font-size: 18px !important;
+    padding: 20px !important;
+    box-shadow: 0 4px 15px 0 rgba(102, 126, 234, 0.75) !important;
+    border: none !important;
+}
+.run_matting_button:hover {
+    background: linear-gradient(135deg, #764ba2 0%, #667eea 50%, #f093fb 100%) !important;
+    box-shadow: 0 6px 20px 0 rgba(102, 126, 234, 0.9) !important;
+    transform: translateY(-2px) !important;
+}
+"""
+# Build Gradio interface
+with gr.Blocks(css=custom_css, title="VideoMaMa Demo") as demo:
+    gr.HTML('<div class="title-text">VideoMaMa Interactive Demo</div>')
+    gr.Markdown(
+        '<div class="description-text">🎬 Upload a video → 🖱️ Click to mark object → ✅ Generate masks → 🎨 Run VideoMaMa</div>'
+    )
+    gr.Markdown(
+        '<div style="text-align: center; color: #6b7280; font-size: 14px; margin-top: -10px;">Note: VideoMaMa processes the selected number of frames (1-50). Longer videos will be randomly sampled.</div>'
+    )
+    # State variables
+    video_state = gr.State(None)
+    click_state = gr.State([[], []])  # [[points], [labels]]
+    with gr.Row():
+        with gr.Column(scale=1):
+            gr.Markdown("### Step 1: Upload Video")
+            video_input = gr.Video(label="Input Video")
+            num_frames_slider = gr.Slider(
+                minimum=1,
+                maximum=50,
+                value=24,
+                step=1,
+                label="Number of Frames to Process",
+                info="VideoMaMa will process only this many frames. More frames = better quality but slower."
+            )
+            load_button = gr.Button("📁 Load Video", variant="primary")
+            gr.Markdown("### Step 2: Mark Object")
+            point_prompt = gr.Radio(
+                choices=["Positive", "Negative"],
+                value="Positive",
+                label="Click Type",
+                info="Positive: object, Negative: background",
+                visible=False
+            )
+            clear_button = gr.Button("🗑️ Clear Clicks", visible=False)
+        with gr.Column(scale=1):
+            gr.Markdown("### First Frame (Click to Add Points)")
+            first_frame_display = gr.Image(
+                label="First Frame",
+                type="pil",
+                interactive=True
+            )
+            run_button = gr.Button("🚀 Run Matting", visible=False, elem_classes="run_matting_button", size="lg")
+    status_text = gr.Textbox(label="Status", value="", interactive=False, visible=False)
+    gr.Markdown("### Outputs")
+    with gr.Row():
+        with gr.Column():
+            output_video = gr.Video(label="Matting Result", autoplay=True)
+        with gr.Column():
+            greenscreen_video = gr.Video(label="Greenscreen Composite", autoplay=True)
+        with gr.Column():
+            mask_video = gr.Video(label="Mask Track", autoplay=True)
+    # Event handlers
+    load_button.click(
+        fn=load_video,
+        inputs=[video_input, video_state, num_frames_slider],
+        outputs=[video_state, first_frame_display,
+                point_prompt, clear_button, run_button, status_text]
+    )
+    first_frame_display.select(
+        fn=sam_refine,
+        inputs=[video_state, point_prompt, click_state],
+        outputs=[first_frame_display, video_state, click_state]
+    )
+    clear_button.click(
+        fn=clear_clicks,
+        inputs=[video_state, click_state],
+        outputs=[first_frame_display, video_state, click_state]
+    )
+    run_button.click(
+        fn=run_videomama_with_sam2,
+        inputs=[video_state, click_state],
+        outputs=[video_state, output_video, mask_video, greenscreen_video, status_text]
+    )
+    video_input.change(
+        fn=restart,
+        inputs=[],
+        outputs=[video_state, click_state, first_frame_display,
+                point_prompt, clear_button, run_button,
+                output_video, mask_video, greenscreen_video, status_text]
+    )
+    # Examples
+    gr.Markdown("---\n### 📦 Example Videos")
+    example_dir = Path("samples")
+    if example_dir.exists():
+        examples = [str(p) for p in sorted(example_dir.glob("*.mp4"))]
+        if examples:
+            gr.Examples(examples=examples, inputs=[video_input])
+if __name__ == "__main__":
+    print("=" * 60)
+    print("VideoMaMa Interactive Demo")
+    print("=" * 60)
+    # Clean up old output files on startup
+    cleanup_old_outputs(max_age_minutes=30)
+    # Initialize models
+    initialize_models()
+    # Launch demo
+    demo.queue()
+    demo.launch(
+        server_name="127.0.0.1",
+        server_port=7860,
+        share=True
+    )

download_checkpoints.sh ADDED Viewed

	@@ -0,0 +1,78 @@

+#!/bin/bash
+# Download model checkpoints for VideoMaMa demo
+set -e
+echo "🔽 Downloading model checkpoints for VideoMaMa demo..."
+echo ""
+# Create checkpoints directory
+echo "Creating checkpoints directory..."
+mkdir -p checkpoints
+echo "✓ Directory created"
+echo ""
+# Download SAM2 checkpoint
+echo "Downloading SAM2 checkpoint..."
+echo "URL: https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt"
+echo "This may take a few minutes (file size: ~900MB)..."
+if command -v wget &> /dev/null; then
+    wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt \
+         -O checkpoints/sam2/sam2_hiera_large.pt
+elif command -v curl &> /dev/null; then
+    curl -L https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt \
+         -o checkpoints/sam2/sam2_hiera_large.pt
+else
+    echo "❌ Error: Neither wget nor curl is available. Please install one of them."
+    exit 1
+fi
+echo "✓ SAM2 checkpoint downloaded successfully"
+echo ""
+# Check if VideoMaMa checkpoint exists
+echo "Checking VideoMaMa checkpoint..."
+if [ -d "checkpoints/VideoMaMa" ]; then
+    if [ -f "checkpoints/VideoMaMa/config.json" ] && \
+       { [ -f "checkpoints/VideoMaMa/diffusion_pytorch_model.safetensors" ] || \
+         [ -f "checkpoints/VideoMaMa/diffusion_pytorch_model.bin" ]; }; then
+        echo "✓ VideoMaMa checkpoint already exists"
+    else
+        echo "⚠️  VideoMaMa checkpoint directory exists but is incomplete"
+        echo "   Please add the following files to checkpoints/VideoMaMa/:"
+        echo "   - config.json"
+        echo "   - diffusion_pytorch_model.safetensors (or .bin)"
+    fi
+else
+    echo "⚠️  VideoMaMa checkpoint not found"
+    echo ""
+    echo "📝 Manual step required:"
+    echo "   1. Create directory: checkpoints/VideoMaMa/"
+    echo "   2. Copy your trained VideoMaMa checkpoint files:"
+    echo "      - config.json"
+    echo "      - diffusion_pytorch_model.safetensors (or .bin)"
+    echo ""
+    echo "   Example:"
+    echo "   mkdir -p checkpoints/VideoMaMa"
+    echo "   cp /path/to/your/checkpoint/* checkpoints/VideoMaMa/"
+fi
+echo ""
+echo "="*70
+echo "✨ Checkpoint download complete!"
+echo "="*70
+echo ""
+echo "Next steps:"
+echo "1. Verify checkpoints are in place:"
+echo "   python test_setup.py"
+echo ""
+echo "2. (Optional) Add sample videos:"
+echo "   mkdir -p samples"
+echo "   cp your_sample.mp4 samples/"
+echo ""
+echo "3. Test locally:"
+echo "   python app.py"
+echo ""
+echo "4. Deploy to Hugging Face Space"
+echo ""

enhanced_ui.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import gradio as gr
+import numpy as np
+from PIL import Image
+def create_enhanced_ui():
+    with gr.Blocks() as demo:
+        gr.Markdown("# VideoMaMa - Enhanced Segmentation")
+        with gr.Row():
+            with gr.Column():
+                video_input = gr.Video(label="Upload Video")
+                # Segmentation method selector
+                seg_method = gr.Radio(
+                    ["Click Points", "Brush/Draw", "Text Prompt"],
+                    label="Segmentation Method",
+                    value="Click Points"
+                )
+                # Text prompt input (shown when Text Prompt selected)
+                text_prompt = gr.Textbox(
+                    label="Text Prompt",
+                    placeholder="e.g., 'person', 'piano', 'cat'",
+                    visible=False
+                )
+                # Image editor with multiple tools
+                image_editor = gr.Image(
+                    label="Select/Draw Object",
+                    tool="sketch",  # Brush tool
+                    brush_radius=15,
+                    brush_color="#FF0000"
+                )
+                process_btn = gr.Button("Process Video", variant="primary")
+            with gr.Column():
+                output_video = gr.Video(label="Result")
+                mask_preview = gr.Image(label="Mask Preview")
+        # Toggle text input visibility based on method
+        def update_visibility(method):
+            return gr.update(visible=(method == "Text Prompt"))
+        seg_method.change(
+            update_visibility,
+            inputs=[seg_method],
+            outputs=[text_prompt]
+        )
+        process_btn.click(
+            process_video_enhanced,
+            inputs=[video_input, seg_method, text_prompt, image_editor],
+            outputs=[output_video, mask_preview]
+        )
+    return demo
+def process_video_enhanced(video, method, text_prompt, image_data):
+    if method == "Text Prompt":
+        # Use Grounding DINO + SAM2
+        points = text_to_points(text_prompt, video)
+    elif method == "Brush/Draw":
+        # Use drawn mask directly
+        mask = image_data_to_mask(image_data)
+    else:
+        # Use click points (original method)
+        points = extract_points_from_clicks(image_data)
+    # Process with VideoMaMa (existing pipeline)
+    return videomama_pipeline.process(video, points)

pipeline_svd_mask.py ADDED Viewed

	@@ -0,0 +1,1038 @@

+# pipeline_svd_masked.py
+import inspect
+from dataclasses import dataclass
+from typing import Callable, Dict, List, Optional, Union
+import numpy as np
+import PIL.Image
+import torch
+from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
+from diffusers.image_processor import PipelineImageInput
+from diffusers.models import AutoencoderKLTemporalDecoder, UNetSpatioTemporalConditionModel
+from diffusers.schedulers import EulerDiscreteScheduler
+from diffusers.utils import BaseOutput, logging, replace_example_docstring
+from diffusers.utils.torch_utils import randn_tensor
+from diffusers.video_processor import VideoProcessor
+from diffusers.pipelines.pipeline_utils import DiffusionPipeline
+# Import necessary helpers from the original SVD pipeline
+from diffusers.pipelines.stable_video_diffusion.pipeline_stable_video_diffusion import (
+    _append_dims,
+    retrieve_timesteps,
+    _resize_with_antialiasing,
+)
+import torch.nn.functional as F
+from einops import rearrange
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+EXAMPLE_DOC_STRING = """
+    Examples:
+        ```py
+        >>> from pipeline_svd_masked import StableVideoDiffusionPipelineWithMask
+        >>> from diffusers.utils import load_image, export_to_video
+        >>> # Load your fine-tuned UNet, VAE, etc.
+        >>> pipe = StableVideoDiffusionPipelineWithMask.from_pretrained(
+        ...     "path/to/your/finetuned_model", torch_dtype=torch.float16, variant="fp16"
+        ... )
+        >>> pipe.to("cuda")
+        >>> # Load the conditioning image and the mask
+        >>> image = load_image("path/to/your/conditioning_image.png").resize((1024, 576))
+        >>> mask = load_image("path/to/your/mask_image.png").resize((1024, 576))
+        >>> # Generate frames
+        >>> frames = pipe(
+        ...     image=image,
+        ...     mask_image=mask,
+        ...     num_frames=25,
+        ...     decode_chunk_size=8
+        ... ).frames[0]
+        >>> export_to_video(frames, "generated_video.mp4", fps=7)
+        ```
+"""
+@dataclass
+class StableVideoDiffusionPipelineOutput(BaseOutput):
+    r"""
+    Output class for the custom Stable Video Diffusion pipeline.
+    Args:
+        frames (`[List[List[PIL.Image.Image]]`, `np.ndarray`, `torch.Tensor`]):
+            List of denoised PIL images of length `batch_size` or numpy array or torch tensor of shape
+            `(batch_size, num_frames, height, width, num_channels)`.
+    """
+    frames: Union[List[List[PIL.Image.Image]], np.ndarray, torch.Tensor]
+class StableVideoDiffusionPipelineWithMask(DiffusionPipeline):
+    r"""
+    A custom pipeline based on Stable Video Diffusion that accepts an additional mask for conditioning.
+    This pipeline is designed to work with a UNet fine-tuned to accept 12 input channels
+    (4 for noise, 4 for VAE-encoded condition image, 4 for VAE-encoded mask).
+    """
+    model_cpu_offload_seq = "image_encoder->unet->vae"
+    _callback_tensor_inputs = ["latents"]
+    def __init__(
+            self,
+            vae: AutoencoderKLTemporalDecoder,
+            image_encoder: CLIPVisionModelWithProjection,
+            unet: UNetSpatioTemporalConditionModel,
+            scheduler: EulerDiscreteScheduler,
+            feature_extractor: CLIPImageProcessor,
+    ):
+        super().__init__()
+        self.register_modules(
+            vae=vae,
+            image_encoder=image_encoder,
+            unet=unet,
+            scheduler=scheduler,
+            feature_extractor=feature_extractor,
+        )
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+        self.video_processor = VideoProcessor(do_resize=True, vae_scale_factor=self.vae_scale_factor)
+    def _encode_image(
+            self,
+            image: PipelineImageInput,
+            device: Union[str, torch.device],
+            num_videos_per_prompt: int,
+    ) -> torch.Tensor:
+        dtype = next(self.image_encoder.parameters()).dtype
+        if not isinstance(image, torch.Tensor):
+            image = self.video_processor.pil_to_numpy(image)
+            image = self.video_processor.numpy_to_pt(image)
+        image = image * 2.0 - 1.0
+        image = _resize_with_antialiasing(image, (224, 224))
+        image = (image + 1.0) / 2.0
+        image = self.feature_extractor(
+            images=image,
+            do_normalize=True,
+            do_center_crop=False,
+            do_resize=False,
+            do_rescale=False,
+            return_tensors="pt",
+        ).pixel_values
+        image = image.to(device=device, dtype=dtype)
+        image_embeddings = self.image_encoder(image).image_embeds
+        image_embeddings = image_embeddings.unsqueeze(1)
+        bs_embed, seq_len, _ = image_embeddings.shape
+        image_embeddings = torch.zeros_like(image_embeddings)
+        return image_embeddings
+    def _encode_vae_image(
+            self,
+            image: torch.Tensor,
+            device: Union[str, torch.device],
+            num_videos_per_prompt: int,
+    ):
+        image = image.to(device=device, dtype=torch.float16)
+        image_latents = self.vae.encode(image).latent_dist.sample()
+        image_latents = image_latents.repeat(num_videos_per_prompt, 1, 1, 1)
+        return image_latents
+    def _get_add_time_ids(
+            self,
+            fps: int,
+            motion_bucket_id: int,
+            noise_aug_strength: float,
+            dtype: torch.dtype,
+            batch_size: int,
+            num_videos_per_prompt: int,
+    ):
+        add_time_ids = [fps, motion_bucket_id, noise_aug_strength]
+        passed_add_embed_dim = self.unet.config.addition_time_embed_dim * len(add_time_ids)
+        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features
+        if expected_add_embed_dim != passed_add_embed_dim:
+            raise ValueError(
+                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created."
+            )
+        add_time_ids = torch.tensor([add_time_ids], dtype=dtype)
+        add_time_ids = add_time_ids.repeat(batch_size * num_videos_per_prompt, 1)
+        return add_time_ids
+    def decode_latents(self, latents: torch.Tensor, num_frames: int, decode_chunk_size: int = 14):
+        latents = latents.flatten(0, 1).to(dtype=torch.float16)
+        latents = 1 / self.vae.config.scaling_factor * latents
+        frames = []
+        for i in range(0, latents.shape[0], decode_chunk_size):
+            num_frames_in = latents[i: i + decode_chunk_size].shape[0]
+            frame = self.vae.decode(latents[i: i + decode_chunk_size], num_frames=num_frames_in).sample
+            frames.append(frame)
+        frames = torch.cat(frames, dim=0)
+        frames = frames.reshape(-1, num_frames, *frames.shape[1:]).permute(0, 2, 1, 3, 4)
+        frames = frames.float()
+        return frames
+    def check_inputs(self, image, height, width):
+        if (
+                not isinstance(image, torch.Tensor)
+                and not isinstance(image, PIL.Image.Image)
+                and not isinstance(image, list)
+        ):
+            raise ValueError(f"`image` has to be of type `torch.Tensor` or `PIL.Image.Image` but is {type(image)}")
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+    def prepare_latents(
+            self,
+            batch_size: int,
+            num_frames: int,
+            height: int,
+            width: int,
+            dtype: torch.dtype,
+            device: Union[str, torch.device],
+            generator: torch.Generator,
+            latents: Optional[torch.Tensor] = None,
+            initial_latents: Optional[torch.Tensor] = None,
+            denoising_strength: float = 1.0,
+            timestep: Optional[torch.Tensor] = None,
+    ):
+        num_channels_latents = self.unet.config.out_channels
+        shape = (
+            batch_size,
+            num_frames,
+            num_channels_latents,
+            height // self.vae_scale_factor,
+            width // self.vae_scale_factor,
+        )
+        if initial_latents is not None:
+            # Noise is added to the initial latents
+            noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            # Get the initial latents at the given timestep
+            latents = self.scheduler.add_noise(initial_latents, noise, timestep)
+        else:
+            # Standard pure noise generation
+            if latents is None:
+                latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+            else:
+                latents = latents.to(device)
+            # Scale the initial noise by the standard deviation required by the scheduler
+            latents = latents * self.scheduler.init_noise_sigma
+        return latents
+    def _encode_video_vae(
+            self,
+            video_frames: torch.Tensor,  # Expects (B, F, C, H, W)
+            device: Union[str, torch.device],
+    ):
+        video_frames = video_frames.to(device=device, dtype=self.vae.dtype)
+        batch_size, num_frames = video_frames.shape[:2]
+        # Reshape for VAE encoding
+        video_frames_reshaped = video_frames.reshape(batch_size * num_frames, *video_frames.shape[2:])  # (B*F, C, H, W)
+        latents = self.vae.encode(video_frames_reshaped).latent_dist.sample()  # (B*F, C_latent, H_latent, W_latent)
+        # Reshape back to video format
+        latents = latents.reshape(batch_size, num_frames, *latents.shape[1:])  # (B, F, C_latent, H_latent, W_latent)
+        return latents
+    @torch.no_grad()
+    def __call__(
+            self,
+            image: Union[List[PIL.Image.Image], torch.Tensor],
+            mask_image: Union[List[PIL.Image.Image], torch.Tensor],
+            alpha_matte_image: Optional[Union[List[PIL.Image.Image], torch.Tensor]] = None,
+            denoising_strength: float = 0.7,
+            height: int = 576,
+            width: int = 1024,
+            num_frames: Optional[int] = None,
+            num_inference_steps: int = 30,
+            sigmas: Optional[List[float]] = None,
+            fps: int = 7,
+            motion_bucket_id: int = 127,
+            noise_aug_strength: float = 0.02,
+            decode_chunk_size: Optional[int] = None,
+            num_videos_per_prompt: Optional[int] = 1,
+            generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+            latents: Optional[torch.Tensor] = None,
+            output_type: Optional[str] = "pil",
+            return_dict: bool = True,
+            mask_noise_strength: float = 0.0,
+    ):
+        height = height or self.unet.config.sample_size * self.vae_scale_factor
+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+        if num_frames is None:
+            if isinstance(image, list):
+                num_frames = len(image)
+            else:
+                num_frames = self.unet.config.num_frames
+        decode_chunk_size = decode_chunk_size if decode_chunk_size is not None else num_frames
+        self.check_inputs(image, height, width)
+        self.check_inputs(mask_image, height, width)
+        if alpha_matte_image:
+            self.check_inputs(alpha_matte_image, height, width)
+        batch_size = 1
+        device = self._execution_device
+        dtype = self.unet.dtype
+        image_for_clip = image[0] if isinstance(image, list) else image[0]
+        image_embeddings = self._encode_image(image_for_clip, device, num_videos_per_prompt)
+        fps = fps - 1
+        image_tensor = self.video_processor.preprocess(image, height=height, width=width).to(device).unsqueeze(0)
+        mask_tensor = self.video_processor.preprocess(mask_image, height=height, width=width).to(device).unsqueeze(0)
+        noise = randn_tensor(image_tensor.shape, generator=generator, device=device, dtype=dtype)
+        image_tensor = image_tensor + noise_aug_strength * noise
+        conditional_latents = self._encode_video_vae(image_tensor, device)
+        conditional_latents = conditional_latents / self.vae.config.scaling_factor
+        if self.unet.config.in_channels == 12:
+            mask_latents = self._encode_video_vae(mask_tensor, device)
+            mask_latents = mask_latents / self.vae.config.scaling_factor
+        elif self.unet.config.in_channels == 9:
+            mask_tensor_gray = mask_tensor.mean(dim=2, keepdim=True)
+            binarized_mask = (mask_tensor_gray > 0.0).to(dtype)
+            b, f, c, h, w = binarized_mask.shape
+            binarized_mask_reshaped = binarized_mask.reshape(b * f, c, h, w)
+            target_size = (height // self.vae_scale_factor, width // self.vae_scale_factor)
+            interpolated_mask = F.interpolate(
+                binarized_mask_reshaped,
+                size=target_size,
+                mode='nearest',
+            )
+            mask_latents = interpolated_mask.reshape(b, f, *interpolated_mask.shape[1:])
+        else:
+            raise ValueError(f"Unsupported number of UNet input channels: {self.unet.config.in_channels}.")
+        if mask_noise_strength > 0.0:
+            mask_noise = randn_tensor(mask_latents.shape, generator=generator, device=device, dtype=dtype)
+            mask_latents = mask_latents + mask_noise_strength * mask_noise
+        added_time_ids = self._get_add_time_ids(
+            fps, motion_bucket_id, noise_aug_strength, image_embeddings.dtype, batch_size, num_videos_per_prompt
+        )
+        added_time_ids = added_time_ids.to(device)
+        # --- MODIFIED FOR ALPHA MATTE REFINEMENT ---
+        timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, None, sigmas)
+        # self.scheduler.set_timesteps(num_inference_steps, device=device)
+        # timesteps = self.scheduler.timesteps
+        initial_latents = None
+        if alpha_matte_image is not None:
+            alpha_matte_tensor = self.video_processor.preprocess(alpha_matte_image, height=height, width=width).to(
+                device).unsqueeze(0)
+            initial_latents = self._encode_video_vae(alpha_matte_tensor, device)
+            initial_latents = initial_latents / self.vae.config.scaling_factor
+            # Adjust the number of steps and the timesteps to start from
+            t_start = max(num_inference_steps - int(num_inference_steps * denoising_strength), 0)
+            timesteps = timesteps[t_start:]
+            # We need the first timestep to add the correct amount of noise
+            start_timestep = timesteps[0]
+        else:
+            start_timestep = timesteps[0]  # Not used, but for clarity
+        latents = self.prepare_latents(
+            batch_size * num_videos_per_prompt,
+            num_frames,
+            height,
+            width,
+            dtype,
+            device,
+            generator,
+            latents,
+            initial_latents=initial_latents,
+            denoising_strength=denoising_strength,
+            timestep=start_timestep if initial_latents is not None else None,
+        )
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        self._num_timesteps = len(timesteps)
+        with self.progress_bar(total=len(timesteps)) as progress_bar:
+            for i, t in enumerate(timesteps):
+                latent_model_input = self.scheduler.scale_model_input(latents, t)
+                latent_model_input = torch.cat([latent_model_input, conditional_latents, mask_latents], dim=2)
+                noise_pred = self.unet(
+                    latent_model_input, t, encoder_hidden_states=image_embeddings, added_time_ids=added_time_ids,
+                    return_dict=False
+                )[0]
+                latents = self.scheduler.step(noise_pred, t, latents).prev_sample
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+        frames = self.decode_latents(latents, num_frames, decode_chunk_size)
+        frames = self.video_processor.postprocess_video(video=frames, output_type=output_type)
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return frames
+        return StableVideoDiffusionPipelineOutput(frames=frames)
+class StableVideoDiffusionPipelineOnestepWithMask(DiffusionPipeline):
+    r"""
+    A custom pipeline based on Stable Video Diffusion that accepts an additional mask for conditioning.
+    This pipeline is designed to work with a UNet fine-tuned to accept 12 input channels
+    (4 for noise, 4 for VAE-encoded condition image, 4 for VAE-encoded mask).
+    """
+    model_cpu_offload_seq = "image_encoder->unet->vae"
+    _callback_tensor_inputs = ["latents"]
+    def __init__(
+            self,
+            vae: AutoencoderKLTemporalDecoder,
+            image_encoder: CLIPVisionModelWithProjection,
+            unet: UNetSpatioTemporalConditionModel,
+            scheduler: EulerDiscreteScheduler,
+            feature_extractor: CLIPImageProcessor,
+    ):
+        super().__init__()
+        self.register_modules(
+            vae=vae,
+            image_encoder=image_encoder,
+            unet=unet,
+            scheduler=scheduler,
+            feature_extractor=feature_extractor,
+        )
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+        self.video_processor = VideoProcessor(do_resize=True, vae_scale_factor=self.vae_scale_factor)
+    def _encode_image(
+            self,
+            image: PipelineImageInput,
+            device: Union[str, torch.device],
+            num_videos_per_prompt: int,
+    ) -> torch.Tensor:
+        dtype = next(self.image_encoder.parameters()).dtype
+        if not isinstance(image, torch.Tensor):
+            image = self.video_processor.pil_to_numpy(image)
+            image = self.video_processor.numpy_to_pt(image)
+        image = image * 2.0 - 1.0
+        image = _resize_with_antialiasing(image, (224, 224))
+        image = (image + 1.0) / 2.0
+        image = self.feature_extractor(
+            images=image,
+            do_normalize=True,
+            do_center_crop=False,
+            do_resize=False,
+            do_rescale=False,
+            return_tensors="pt",
+        ).pixel_values
+        image = image.to(device=device, dtype=dtype)
+        image_embeddings = self.image_encoder(image).image_embeds
+        image_embeddings = image_embeddings.unsqueeze(1)
+        bs_embed, seq_len, _ = image_embeddings.shape
+        image_embeddings = torch.zeros_like(image_embeddings)
+        return image_embeddings
+    def _encode_vae_image(
+            self,
+            image: torch.Tensor,
+            device: Union[str, torch.device],
+            num_videos_per_prompt: int,
+    ):
+        image = image.to(device=device, dtype=torch.float16)
+        image_latents = self.vae.encode(image).latent_dist.sample()
+        image_latents = image_latents.repeat(num_videos_per_prompt, 1, 1, 1)
+        return image_latents
+    def _get_add_time_ids(
+            self,
+            fps: int,
+            motion_bucket_id: int,
+            noise_aug_strength: float,
+            dtype: torch.dtype,
+            batch_size: int,
+            num_videos_per_prompt: int,
+    ):
+        add_time_ids = [fps, motion_bucket_id, noise_aug_strength]
+        passed_add_embed_dim = self.unet.config.addition_time_embed_dim * len(add_time_ids)
+        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features
+        if expected_add_embed_dim != passed_add_embed_dim:
+            raise ValueError(
+                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created."
+            )
+        add_time_ids = torch.tensor([add_time_ids], dtype=dtype)
+        add_time_ids = add_time_ids.repeat(batch_size * num_videos_per_prompt, 1)
+        return add_time_ids
+    def decode_latents(self, latents: torch.Tensor, num_frames: int, decode_chunk_size: int = 14):
+        latents = latents.flatten(0, 1).to(dtype=torch.float16)
+        latents = 1 / self.vae.config.scaling_factor * latents
+        frames = []
+        for i in range(0, latents.shape[0], decode_chunk_size):
+            num_frames_in = latents[i: i + decode_chunk_size].shape[0]
+            frame = self.vae.decode(latents[i: i + decode_chunk_size], num_frames=num_frames_in).sample
+            frames.append(frame)
+        frames = torch.cat(frames, dim=0)
+        frames = frames.reshape(-1, num_frames, *frames.shape[1:]).permute(0, 2, 1, 3, 4)
+        frames = frames.float()
+        return frames
+    def check_inputs(self, image, height, width):
+        if (
+                not isinstance(image, torch.Tensor)
+                and not isinstance(image, PIL.Image.Image)
+                and not isinstance(image, list)
+        ):
+            raise ValueError(f"`image` has to be of type `torch.Tensor` or `PIL.Image.Image` but is {type(image)}")
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+    def prepare_latents(
+            self,
+            batch_size: int,
+            num_frames: int,
+            height: int,
+            width: int,
+            dtype: torch.dtype,
+            device: Union[str, torch.device],
+            generator: torch.Generator,
+            latents: Optional[torch.Tensor] = None,
+    ):
+        # The number of channels for the initial noise is based on the UNet's out_channels
+        num_channels_latents = self.unet.config.out_channels
+        shape = (
+            batch_size,
+            num_frames,
+            num_channels_latents,
+            height // self.vae_scale_factor,
+            width // self.vae_scale_factor,
+        )
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(f"batch size {batch_size} must match the length of the generators {len(generator)}.")
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            latents = latents.to(device)
+        latents = latents * self.scheduler.init_noise_sigma
+        return latents
+    def _encode_video_vae(
+            self,
+            video_frames: torch.Tensor,  # Expects (B, F, C, H, W)
+            device: Union[str, torch.device],
+    ):
+        video_frames = video_frames.to(device=device, dtype=self.vae.dtype)
+        batch_size, num_frames = video_frames.shape[:2]
+        # Reshape for VAE encoding
+        video_frames_reshaped = video_frames.reshape(batch_size * num_frames, *video_frames.shape[2:])  # (B*F, C, H, W)
+        latents = self.vae.encode(video_frames_reshaped).latent_dist.sample()  # (B*F, C_latent, H_latent, W_latent)
+        # Reshape back to video format
+        latents = latents.reshape(batch_size, num_frames, *latents.shape[1:])  # (B, F, C_latent, H_latent, W_latent)
+        return latents
+    @torch.no_grad()
+    def __call__(
+            self,
+            image: Union[List[PIL.Image.Image], torch.Tensor],
+            mask_image: Union[List[PIL.Image.Image], torch.Tensor],
+            height: int = 576,
+            width: int = 1024,
+            num_frames: Optional[int] = None,
+            fps: int = 7,
+            motion_bucket_id: int = 127,
+            noise_aug_strength: float = 0.0,
+            decode_chunk_size: Optional[int] = None,
+            num_videos_per_prompt: Optional[int] = 1,
+            generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+            latents: Optional[torch.Tensor] = None,
+            output_type: Optional[str] = "pil",
+            return_dict: bool = True,
+            mask_noise_strength: float = 0.0,
+    ):
+        height = height or self.unet.config.sample_size * self.vae_scale_factor
+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+        if num_frames is None:
+            if isinstance(image, list):
+                num_frames = len(image)
+            else:
+                num_frames = self.unet.config.num_frames
+        decode_chunk_size = decode_chunk_size if decode_chunk_size is not None else num_frames
+        self.check_inputs(image, height, width)
+        self.check_inputs(mask_image, height, width)
+        if isinstance(image, list) and isinstance(mask_image, list):
+            if len(image) != len(mask_image):
+                raise ValueError("`image` and `mask_image` must have the same number of frames.")
+            if num_frames != len(image):
+                logger.warning(
+                    f"Mismatch between `num_frames` ({num_frames}) and number of input images ({len(image)}). Using {len(image)}.")
+                num_frames = len(image)
+        batch_size = 1
+        device = self._execution_device
+        dtype = self.unet.dtype
+        image_for_clip = image[0] if isinstance(image, list) else image[0]
+        image_embeddings = self._encode_image(image_for_clip, device, num_videos_per_prompt)
+        fps = fps - 1
+        image_tensor = self.video_processor.preprocess(image, height=height, width=width).to(device).unsqueeze(0)
+        mask_tensor = self.video_processor.preprocess(mask_image, height=height, width=width).to(
+            device).unsqueeze(0)
+        noise = randn_tensor(image_tensor.shape, generator=generator, device=device, dtype=dtype)
+        image_tensor = image_tensor + noise_aug_strength * noise
+        conditional_latents = self._encode_video_vae(image_tensor, device)
+        conditional_latents = conditional_latents / self.vae.config.scaling_factor
+        if self.unet.config.in_channels == 12:
+            mask_latents = self._encode_video_vae(mask_tensor, device)
+            mask_latents = mask_latents / self.vae.config.scaling_factor
+        elif self.unet.config.in_channels == 9:
+            mask_tensor_gray = mask_tensor.mean(dim=2, keepdim=True)
+            binarized_mask = (mask_tensor_gray > 0.0).to(dtype)
+            b, f, c, h, w = binarized_mask.shape
+            binarized_mask_reshaped = binarized_mask.reshape(b * f, c, h, w)
+            target_size = (height // self.vae_scale_factor, width // self.vae_scale_factor)
+            interpolated_mask = F.interpolate(
+                binarized_mask_reshaped,
+                size=target_size,
+                mode='nearest',
+            )
+            mask_latents = interpolated_mask.reshape(b, f, *interpolated_mask.shape[1:])
+        else:
+            raise ValueError(
+                f"Unsupported number of UNet input channels: {self.unet.config.in_channels}. "
+                "This pipeline only supports 9 (for interpolated mask) or 12 (for VAE mask)."
+            )
+        if mask_noise_strength > 0.0:
+            mask_noise = randn_tensor(mask_latents.shape, generator=generator, device=device, dtype=dtype)
+            mask_latents = mask_latents + mask_noise_strength * mask_noise
+        added_time_ids = self._get_add_time_ids(
+            fps, motion_bucket_id, noise_aug_strength, image_embeddings.dtype, batch_size, num_videos_per_prompt
+        )
+        added_time_ids = added_time_ids.to(device)
+        # **MODIFIED FOR SINGLE-STEP**: Prepare initial noise
+        num_channels_latents = self.unet.config.out_channels
+        shape = (
+            batch_size * num_videos_per_prompt,
+            num_frames,
+            num_channels_latents,
+            height // self.vae_scale_factor,
+            width // self.vae_scale_factor,
+        )
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        # **MODIFIED FOR SINGLE-STEP**: Set a fixed high timestep
+        timestep = torch.tensor([1.0], dtype=dtype, device=device)  # Use a high sigma value
+        # **MODIFIED FOR SINGLE-STEP**: Single forward pass
+        latent_model_input = torch.cat([latents, conditional_latents, mask_latents], dim=2)
+        noise_pred = self.unet(
+            latent_model_input, timestep, encoder_hidden_states=image_embeddings, added_time_ids=added_time_ids,
+            return_dict=False
+        )[0]
+        # The model's prediction is the final denoised latent
+        denoised_latents = noise_pred
+        frames = self.decode_latents(denoised_latents, num_frames, decode_chunk_size)
+        frames = self.video_processor.postprocess_video(video=frames, output_type=output_type)
+        self.maybe_free_model_hooks()
+        if not return_dict:
+            return frames
+        return StableVideoDiffusionPipelineOutput(frames=frames)
+class StableVideoDiffusionPipelineWithCrossAtnnMask(DiffusionPipeline):
+    model_cpu_offload_seq = "image_encoder->unet->vae"
+    _callback_tensor_inputs = ["latents"]
+    def __init__(
+            self,
+            vae: AutoencoderKLTemporalDecoder,
+            unet: UNetSpatioTemporalConditionModel,
+            scheduler: EulerDiscreteScheduler,
+            mask_projector: torch.nn.Module,
+            # CLIP models are not strictly needed for inference if embeddings are not used
+            image_encoder: CLIPVisionModelWithProjection = None,
+            feature_extractor: CLIPImageProcessor = None,
+    ):
+        super().__init__()
+        self.register_modules(
+            vae=vae,
+            unet=unet,
+            scheduler=scheduler,
+            mask_projector=mask_projector,
+            image_encoder=image_encoder,
+            feature_extractor=feature_extractor,
+        )
+        self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+        self.video_processor = VideoProcessor(do_resize=False, vae_scale_factor=self.vae_scale_factor)
+    def _encode_image_vae(self, image: torch.Tensor, device: Union[str, torch.device]):
+        image = image.to(device=device, dtype=self.vae.dtype)
+        latent = self.vae.encode(image).latent_dist.sample()
+        return latent
+    def decode_latents(self, latents: torch.Tensor, num_frames: int, decode_chunk_size: int):
+        latents = latents.flatten(0, 1).to(dtype=torch.float16)
+        latents = 1 / self.vae.config.scaling_factor * latents
+        frames = []
+        for i in range(0, latents.shape[0], decode_chunk_size):
+            frame = self.vae.decode(latents[i: i + decode_chunk_size], num_frames=decode_chunk_size).sample
+            frames.append(frame)
+        frames = torch.cat(frames, dim=0)
+        frames = frames.reshape(-1, num_frames, *frames.shape[1:]).permute(0, 2, 1, 3, 4)
+        frames = frames.float()
+        return frames
+    def _encode_video_vae(
+            self,
+            video_frames: torch.Tensor,  # Expects (B, F, C, H, W)
+            device: Union[str, torch.device],
+    ):
+        video_frames = video_frames.to(device=device, dtype=self.vae.dtype)
+        batch_size, num_frames = video_frames.shape[:2]
+        # Reshape for VAE encoding
+        video_frames_reshaped = video_frames.reshape(batch_size * num_frames, *video_frames.shape[2:])  # (B*F, C, H, W)
+        latents = self.vae.encode(video_frames_reshaped).latent_dist.sample()  # (B*F, C_latent, H_latent, W_latent)
+        # Reshape back to video format
+        latents = latents.reshape(batch_size, num_frames, *latents.shape[1:])  # (B, F, C_latent, H_latent, W_latent)
+        return latents
+    @torch.no_grad()
+    def __call__(
+            self,
+            image: Union[PIL.Image.Image, torch.Tensor],  # Static image for appearance
+            mask_image: List[PIL.Image.Image],  # Video mask for motion
+            height: int = 576,
+            width: int = 1024,
+            num_frames: Optional[int] = None,
+            num_inference_steps: int = 25,
+            fps: int = 7,
+            motion_bucket_id: int = 127,
+            noise_aug_strength: float = 0.0,  # Noise is added to latents now
+            decode_chunk_size: Optional[int] = 8,
+            generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+            output_type: Optional[str] = "pil",
+            return_dict: bool = True,
+    ):
+        device = self._execution_device
+        dtype = self.unet.dtype
+        num_frames = num_frames if num_frames is not None else len(mask_image)
+        decode_chunk_size = decode_chunk_size if decode_chunk_size is not None else num_frames
+        # 1. PREPARE STATIC IMAGE CONDITION
+        image_tensor = self.video_processor.preprocess(image, height, width).to(device).unsqueeze(0)
+        conditional_latents = self._encode_video_vae(image_tensor, device)
+        conditional_latents = conditional_latents / self.vae.config.scaling_factor
+        # 2. PREPARE MASK MOTION CONDITION
+        mask_tensor = self.video_processor.preprocess(mask_image, height, width)
+        if mask_tensor.shape[1] > 1:
+            mask_tensor = mask_tensor.mean(dim=1, keepdim=True)
+        # Reshape for projector: (T, C, H, W)
+        mask_for_projection = rearrange(mask_tensor, "f c h w -> f c h w").to(device, dtype)
+        encoder_hidden_states = self.mask_projector(mask_for_projection)
+        encoder_hidden_states = encoder_hidden_states.unsqueeze(1)  # (T, 1, D)
+        # Add batch dimension for UNet
+        encoder_hidden_states = encoder_hidden_states.unsqueeze(0)  # (1, T, 1, D)
+        # The UNet will handle flattening this to (B*T, 1, D) where B=1
+        # To be safe, we pass it pre-flattened.
+        encoder_hidden_states = rearrange(encoder_hidden_states, "b f s d -> (b f) s d")
+        # 3. PREPARE LATENTS
+        shape = (1, num_frames, self.unet.config.out_channels, height // self.vae_scale_factor,
+                 width // self.vae_scale_factor)
+        latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        if noise_aug_strength > 0:
+            latents += noise_aug_strength * randn_tensor(latents.shape, generator=generator, device=device,
+                                                         dtype=dtype)
+        latents = latents * self.scheduler.init_noise_sigma
+        # 4. GET ADDED TIME IDS
+        # For pipeline, batch size is 1
+        added_time_ids = [fps - 1, motion_bucket_id, 0.0]  # noise_aug_strength for add_time_ids is 0 for inference
+        added_time_ids = torch.tensor([added_time_ids], dtype=dtype, device=device)
+        # 5. DENOISING LOOP
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps = self.scheduler.timesteps
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for t in timesteps:
+                latent_model_input = self.scheduler.scale_model_input(latents, t)
+                unet_input = torch.cat([latent_model_input, conditional_latents], dim=2)
+                noise_pred = self.unet(
+                    unet_input, t, encoder_hidden_states=encoder_hidden_states, added_time_ids=added_time_ids
+                ).sample
+                latents = self.scheduler.step(noise_pred, t, latents).prev_sample
+                progress_bar.update()
+        # 6. DECODE
+        frames = self.decode_latents(latents, num_frames, decode_chunk_size)
+        frames = self.video_processor.postprocess_video(video=frames, output_type=output_type)
+        if not return_dict:
+            return (frames,)
+        return StableVideoDiffusionPipelineOutput(frames=frames)
+# pipeline.py
+import torch
+import torch.nn.functional as F
+from PIL import Image
+from einops import rearrange
+from torchvision import transforms
+from diffusers import AutoencoderKLTemporalDecoder, UNetSpatioTemporalConditionModel
+from transformers import CLIPImageProcessor, CLIPVisionModelWithProjection
+class VideoInferencePipeline:
+    """
+    A reusable pipeline for single-step video diffusion inference.
+    This class encapsulates the models and the core inference logic,
+    separating it from data loading and saving, which can vary between tasks.
+    """
+    def __init__(self, base_model_path: str, unet_checkpoint_path: str, device: str = "cuda",
+                 weight_dtype: torch.dtype = torch.float16):
+        """
+        Loads all necessary models into memory.
+        Args:
+            base_model_path (str): Path to the base Stable Video Diffusion model.
+            unet_checkpoint_path (str): Path to the fine-tuned UNet checkpoint.
+            device (str): The device to run models on ('cuda' or 'cpu').
+            weight_dtype (torch.dtype): The precision for model weights (float16 or bfloat16).
+        """
+        print("--- Initializing Inference Pipeline and Loading Models ---")
+        self.device = torch.device(device if torch.cuda.is_available() else "cpu")
+        self.weight_dtype = weight_dtype
+        # Load models from pretrained paths
+        try:
+            self.feature_extractor = CLIPImageProcessor.from_pretrained(base_model_path, subfolder="feature_extractor")
+            self.image_encoder = CLIPVisionModelWithProjection.from_pretrained(base_model_path,
+                                                                               subfolder="image_encoder",
+                                                                               variant="fp16")
+            self.vae = AutoencoderKLTemporalDecoder.from_pretrained(base_model_path, subfolder="vae", variant="fp16")
+            self.unet = UNetSpatioTemporalConditionModel.from_pretrained(unet_checkpoint_path, subfolder="unet")
+        except Exception as e:
+            raise IOError(f"Fatal error loading models: {e}")
+        # Move models to the specified device and set to evaluation mode
+        self.image_encoder.to(self.device, dtype=self.weight_dtype).eval()
+        self.vae.to(self.device, dtype=self.weight_dtype).eval()
+        self.unet.to(self.device, dtype=self.weight_dtype).eval()
+        print(f"--- Models Loaded Successfully on {self.device} ---")
+    def run(self, cond_frames, mask_frames, seed=42, mask_cond_mode="vae", fps=7, motion_bucket_id=127,
+            noise_aug_strength=0.0):
+        """
+        Runs the core inference process on a sequence of conditioning and mask frames.
+        Args:
+            cond_frames (list[Image.Image]): List of PIL images for conditioning.
+            mask_frames (list[Image.Image]): List of PIL images for the masks.
+            seed (int): Random seed for generation.
+            mask_cond_mode (str): How the mask is conditioned ("vae" or "interpolate").
+            fps (int): Frames per second to condition the model with.
+            motion_bucket_id (int): Motion bucket ID for conditioning.
+            noise_aug_strength (float): Noise augmentation strength.
+        Returns:
+            list[Image.Image]: A list of the generated video frames as PIL Images.
+        """
+        # --- 1. Prepare Tensors ---
+        cond_video_tensor = self._pil_to_tensor(cond_frames).to(self.device)
+        mask_video_tensor = self._pil_to_tensor(mask_frames).to(self.device)
+        if mask_video_tensor.shape[2] != 3:
+            mask_video_tensor = mask_video_tensor.repeat(1, 1, 3, 1, 1)
+        with torch.no_grad():
+            # --- 2. Get CLIP Image Embeddings ---
+            first_frame_tensor = cond_video_tensor[:, 0, :, :, :]
+            pixel_values_for_clip = self._resize_with_antialiasing(first_frame_tensor, (224, 224))
+            pixel_values_for_clip = ((pixel_values_for_clip + 1.0) / 2.0).clamp(0, 1)
+            pixel_values = self.feature_extractor(images=pixel_values_for_clip, return_tensors="pt").pixel_values
+            image_embeddings = self.image_encoder(pixel_values.to(self.device, dtype=self.weight_dtype)).image_embeds
+            encoder_hidden_states = torch.zeros_like(image_embeddings).unsqueeze(1)
+            # --- 3. Prepare Latents ---
+            cond_latents = self._tensor_to_vae_latent(cond_video_tensor.to(self.weight_dtype))
+            cond_latents = cond_latents / self.vae.config.scaling_factor
+            if mask_cond_mode == "vae":
+                mask_latents = self._tensor_to_vae_latent(mask_video_tensor.to(self.weight_dtype))
+                mask_latents = mask_latents / self.vae.config.scaling_factor
+            elif mask_cond_mode == "interpolate":
+                target_shape = cond_latents.shape[-2:]
+                b, t, c, h, w = mask_video_tensor.shape
+                mask_video_reshaped = rearrange(mask_video_tensor, "b t c h w -> (b t) c h w")
+                interpolated_mask = F.interpolate(mask_video_reshaped, size=target_shape, mode='bilinear',
+                                                  align_corners=False)
+                mask_latents = rearrange(interpolated_mask, "(b t) c h w -> b t c h w", b=b)
+            else:
+                raise ValueError(f"Unknown mask_cond_mode: {mask_cond_mode}")
+            # --- 4. Run UNet Single-Step Inference ---
+            generator = torch.Generator(device=self.device).manual_seed(seed)
+            noisy_latents = torch.randn(cond_latents.shape, generator=generator, device=self.device,
+                                        dtype=self.weight_dtype)
+            timesteps = torch.full((1,), 1.0, device=self.device, dtype=torch.long)
+            added_time_ids = self._get_add_time_ids(fps, motion_bucket_id, noise_aug_strength, batch_size=1)
+            unet_input = torch.cat([noisy_latents, cond_latents, mask_latents], dim=2)
+            pred_latents = self.unet(unet_input, timesteps, encoder_hidden_states, added_time_ids=added_time_ids).sample
+            # --- 5. Decode Latents to Video Frames ---
+            pred_latents = (1 / self.vae.config.scaling_factor) * pred_latents.squeeze(0)
+            frames = []
+            # Process in chunks to avoid VRAM issues, especially for long videos
+            for i in range(0, pred_latents.shape[0], 8):
+                chunk = pred_latents[i: i + 8]
+                decoded_chunk = self.vae.decode(chunk, num_frames=chunk.shape[0]).sample
+                frames.append(decoded_chunk)
+            video_tensor = torch.cat(frames, dim=0)
+            video_tensor = (video_tensor / 2.0 + 0.5).clamp(0, 1).mean(dim=1, keepdim=True).repeat(1, 3, 1, 1)
+            # Return a list of PIL images
+            return [transforms.ToPILImage()(frame) for frame in video_tensor]
+    def _pil_to_tensor(self, frames: list[Image.Image]):
+        """Converts a list of PIL images to a normalized video tensor."""
+        video_tensor = torch.stack([transforms.ToTensor()(f) for f in frames]).unsqueeze(0)
+        return video_tensor * 2.0 - 1.0
+    def _tensor_to_vae_latent(self, t: torch.Tensor):
+        """Encodes a video tensor into the VAE's latent space."""
+        video_length = t.shape[1]
+        t = rearrange(t, "b f c h w -> (b f) c h w")
+        latents = self.vae.encode(t).latent_dist.sample()
+        latents = rearrange(latents, "(b f) c h w -> b f c h w", f=video_length)
+        return latents * self.vae.config.scaling_factor
+    def _get_add_time_ids(self, fps, motion_bucket_id, noise_aug_strength, batch_size):
+        """Creates the additional time IDs for conditioning the UNet."""
+        add_time_ids_list = [fps, motion_bucket_id, noise_aug_strength]
+        passed_add_embed_dim = self.unet.config.addition_time_embed_dim * len(add_time_ids_list)
+        expected_add_embed_dim = self.unet.add_embedding.linear_1.in_features
+        if expected_add_embed_dim != passed_add_embed_dim:
+            raise ValueError(
+                f"Model expects an added time embedding vector of length {expected_add_embed_dim}, but a vector of {passed_add_embed_dim} was created.")
+        add_time_ids = torch.tensor([add_time_ids_list], dtype=self.weight_dtype, device=self.device)
+        return add_time_ids.repeat(batch_size, 1)
+    def _resize_with_antialiasing(self, input_tensor, size, interpolation="bicubic", align_corners=True):
+        """
+        Resizes a tensor with anti-aliasing for CLIP input, mirroring k-diffusion.
+        This is a direct copy of the helper function from your original scripts.
+        """
+        h, w = input_tensor.shape[-2:]
+        factors = (h / size[0], w / size[1])
+        sigmas = (max((factors[0] - 1.0) / 2.0, 0.001), max((factors[1] - 1.0) / 2.0, 0.001))
+        ks = int(max(2.0 * 2 * sigmas[0], 3)), int(max(2.0 * 2 * sigmas[1], 3))
+        if (ks[0] % 2) == 0: ks = ks[0] + 1, ks[1]
+        if (ks[1] % 2) == 0: ks = ks[0], ks[1] + 1
+        def _compute_padding(kernel_size):
+            computed = [k - 1 for k in kernel_size]
+            out_padding = 2 * len(kernel_size) * [0]
+            for i in range(len(kernel_size)):
+                computed_tmp = computed[-(i + 1)]
+                pad_front = computed_tmp // 2
+                pad_rear = computed_tmp - pad_front
+                out_padding[2 * i + 0] = pad_front
+                out_padding[2 * i + 1] = pad_rear
+            return out_padding
+        def _filter2d(input_tensor, kernel):
+            b, c, h, w = input_tensor.shape
+            tmp_kernel = kernel[:, None, ...].to(device=input_tensor.device, dtype=input_tensor.dtype)
+            tmp_kernel = tmp_kernel.expand(-1, c, -1, -1)
+            height, width = tmp_kernel.shape[-2:]
+            padding_shape = _compute_padding([height, width])
+            input_tensor_padded = F.pad(input_tensor, padding_shape, mode="reflect")
+            tmp_kernel = tmp_kernel.reshape(-1, 1, height, width)
+            input_tensor_padded = input_tensor_padded.view(-1, tmp_kernel.size(0), input_tensor_padded.size(-2),
+                                                           input_tensor_padded.size(-1))
+            output = F.conv2d(input_tensor_padded, tmp_kernel, groups=tmp_kernel.size(0), padding=0, stride=1)
+            return output.view(b, c, h, w)
+        def _gaussian(window_size, sigma):
+            if isinstance(sigma, float):
+                sigma = torch.tensor([[sigma]])
+            x = (torch.arange(window_size, device=sigma.device, dtype=sigma.dtype) - window_size // 2).expand(
+                sigma.shape[0], -1)
+            if window_size % 2 == 0:
+                x = x + 0.5
+            gauss = torch.exp(-x.pow(2.0) / (2 * sigma.pow(2.0)))
+            return gauss / gauss.sum(-1, keepdim=True)
+        def _gaussian_blur2d(input_tensor, kernel_size, sigma):
+            if isinstance(sigma, tuple):
+                sigma = torch.tensor([sigma], dtype=input_tensor.dtype)
+            else:
+                sigma = sigma.to(dtype=input_tensor.dtype)
+            ky, kx = int(kernel_size[0]), int(kernel_size[1])
+            bs = sigma.shape[0]
+            kernel_x = _gaussian(kx, sigma[:, 1].view(bs, 1))
+            kernel_y = _gaussian(ky, sigma[:, 0].view(bs, 1))
+            out_x = _filter2d(input_tensor, kernel_x[..., None, :])
+            return _filter2d(out_x, kernel_y[..., None])
+        blurred_input = _gaussian_blur2d(input_tensor, ks, sigmas)
+        return F.interpolate(blurred_input, size=size, mode=interpolation, align_corners=align_corners)

requirements.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+# Hugging Face Space Requirements for VideoMaMa Demo
+# Core frameworks
+torch>=2.0.0
+torchvision>=0.15.0
+diffusers>=0.24.0
+transformers>=4.30.0
+# Gradio for UI
+gradio==5.12.0
+# Image and video processing
+opencv-python>=4.8.0
+opencv-contrib-python>=4.8.0
+Pillow>=10.0.0
+numpy>=1.24.0
+scipy>=1.10.0
+# SAM2 dependencies
+git+https://github.com/facebookresearch/sam2.git
+# Additional utilities
+accelerate>=0.20.0
+einops>=0.6.0
+tqdm>=4.65.0
+safetensors>=0.3.0
+# For video export
+imageio>=2.31.0
+imageio-ffmpeg>=0.4.9
+pydantic==2.10.6

sam2_hiera_l.yaml ADDED Viewed

	@@ -0,0 +1,124 @@

+# Model Configuration for SAM2
+# This file should be placed alongside the SAM2 checkpoint
+# SAM 2 Hiera Large Configuration
+model:
+  _target_: sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+    trunk:
+      _target_: sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 144
+      num_heads: 2
+      stages: [2, 6, 36, 4]
+      global_att_blocks: [23, 33, 43]
+      window_pos_embed_bkg_spatial_size: [7, 7]
+      window_spec: [8, 4, 16, 8]
+    neck:
+      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [1152, 576, 288, 144]
+      fpn_top_down_levels: [2, 3]
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [32, 32]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+    _target_: sam2.modeling.memory_encoder.MemoryEncoder
+    out_dim: 64
+    position_encoding:
+      _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+      num_pos_feats: 64
+      normalize: true
+      scale: null
+      temperature: 10000
+    mask_downsampler:
+      _target_: sam2.modeling.memory_encoder.MaskDownSampler
+      kernel_size: 3
+      stride: 2
+      padding: 1
+    fuser:
+      _target_: sam2.modeling.memory_encoder.Fuser
+      layer:
+        _target_: sam2.modeling.memory_encoder.CXBlock
+        dim: 256
+        kernel_size: 7
+        padding: 3
+        layer_scale_init_value: 1e-6
+        use_dwconv: True
+      num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  directly_add_no_mem_embed: true
+  use_high_res_features_in_sam: true
+  multimask_output_in_sam: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  iou_prediction_use_sigmoid: True
+  memory_temporal_stride_for_eval: 1
+  non_overlap_masks_for_mem_enc: true
+  use_obj_ptrs_in_encoder: true
+  max_obj_ptrs_in_encoder: 16
+  add_tpos_enc_to_obj_ptrs: false
+  proj_tpos_enc_in_obj_ptrs: false
+  use_signed_tpos_enc_to_obj_ptrs: false
+  only_obj_ptrs_in_the_past_for_eval: true
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  soft_no_obj_ptr: false
+  use_mlp_for_obj_ptr_proj: true
+  no_obj_embed_spatial: true
+  sam_mask_decoder_extra_args:
+    dynamic_multimask_via_stability: true
+    dynamic_multimask_stability_delta: 0.05
+    dynamic_multimask_stability_thresh: 0.98
+    pred_obj_scores: true
+    pred_obj_scores_mlp: true
+    use_multimask_token_for_obj_ptr: true
+  compile_image_encoder: False

sam2_wrapper.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""
+SAM2 Wrapper for Video Mask Tracking
+Handles mask generation and propagation through video
+"""
+import sys
+sys.path.append("/home/cvlab19/project/samuel/CVPR/sam2")
+import cv2
+import numpy as np
+import torch
+from PIL import Image
+from pathlib import Path
+from typing import List, Tuple
+import tempfile
+import shutil
+from sam2.build_sam import build_sam2_video_predictor
+class SAM2VideoTracker:
+    def __init__(self, checkpoint_path, config_file, device="cuda"):
+        """
+        Initialize SAM2 video tracker
+        Args:
+            checkpoint_path: Path to SAM2 checkpoint
+            config_file: Path to SAM2 config file
+            device: Device to run on
+        """
+        self.device = device
+        self.predictor = build_sam2_video_predictor(
+            config_file=config_file,
+            ckpt_path=checkpoint_path,
+            device=device
+        )
+        print(f"SAM2 video tracker initialized on {device}")
+    def track_video(self, frames: List[np.ndarray], points: List[List[int]],
+                   labels: List[int]) -> List[np.ndarray]:
+        """
+        Track object through video using SAM2
+        Args:
+            frames: List of numpy arrays, [(H,W,3)]*n, uint8 RGB frames
+            points: List of [x, y] coordinates for prompts
+            labels: List of labels (1 for positive, 0 for negative)
+        Returns:
+            masks: List of numpy arrays, [(H,W)]*n, uint8 binary masks
+        """
+        # Create temporary directory for frames
+        temp_dir = Path(tempfile.mkdtemp())
+        frames_dir = temp_dir / "frames"
+        frames_dir.mkdir(exist_ok=True)
+        try:
+            # Save frames to temp directory
+            print(f"Saving {len(frames)} frames to temporary directory...")
+            for i, frame in enumerate(frames):
+                frame_path = frames_dir / f"{i:05d}.jpg"
+                Image.fromarray(frame).save(frame_path, quality=95)
+            # Initialize SAM2 video predictor
+            print("Initializing SAM2 inference state...")
+            inference_state = self.predictor.init_state(video_path=str(frames_dir))
+            # Add prompts on first frame
+            points_array = np.array(points, dtype=np.float32)
+            labels_array = np.array(labels, dtype=np.int32)
+            print(f"Adding {len(points)} point prompts on first frame...")
+            _, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
+                inference_state=inference_state,
+                frame_idx=0,
+                obj_id=1,
+                points=points_array,
+                labels=labels_array,
+            )
+            # Propagate through video
+            print("Propagating masks through video...")
+            masks = []
+            for frame_idx, object_ids, mask_logits in self.predictor.propagate_in_video(inference_state):
+                # Get mask for object ID 1
+                # object_ids can be a tensor or a list
+                obj_ids_list = object_ids.tolist() if hasattr(object_ids, 'tolist') else object_ids
+                if 1 in obj_ids_list:
+                    mask_idx = obj_ids_list.index(1)
+                    mask = (mask_logits[mask_idx] > 0.0).cpu().numpy()
+                    mask_uint8 = (mask.squeeze() * 255).astype(np.uint8)
+                    masks.append(mask_uint8)
+                else:
+                    # No mask for this frame, use empty mask
+                    h, w = frames[0].shape[:2]
+                    masks.append(np.zeros((h, w), dtype=np.uint8))
+            print(f"Generated {len(masks)} masks")
+            return masks
+        finally:
+            # Clean up temporary directory
+            shutil.rmtree(temp_dir, ignore_errors=True)
+    def get_first_frame_mask(self, frame: np.ndarray, points: List[List[int]],
+                            labels: List[int]) -> np.ndarray:
+        """
+        Get mask for first frame only (for preview)
+        Args:
+            frame: np.ndarray, (H, W, 3), uint8 RGB frame
+            points: List of [x, y] coordinates
+            labels: List of labels (1 for positive, 0 for negative)
+        Returns:
+            mask: np.ndarray, (H, W), uint8 binary mask
+        """
+        # Create temporary directory
+        temp_dir = Path(tempfile.mkdtemp())
+        frames_dir = temp_dir / "frames"
+        frames_dir.mkdir(exist_ok=True)
+        try:
+            # Save single frame
+            frame_path = frames_dir / "00000.jpg"
+            Image.fromarray(frame).save(frame_path, quality=95)
+            # Initialize SAM2
+            inference_state = self.predictor.init_state(video_path=str(frames_dir))
+            # Add prompts
+            points_array = np.array(points, dtype=np.float32)
+            labels_array = np.array(labels, dtype=np.int32)
+            _, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
+                inference_state=inference_state,
+                frame_idx=0,
+                obj_id=1,
+                points=points_array,
+                labels=labels_array,
+            )
+            # Get mask
+            if len(out_mask_logits) > 0:
+                mask = (out_mask_logits[0] > 0.0).cpu().numpy()
+                mask_uint8 = (mask.squeeze() * 255).astype(np.uint8)
+                return mask_uint8
+            else:
+                return np.zeros(frame.shape[:2], dtype=np.uint8)
+        finally:
+            shutil.rmtree(temp_dir, ignore_errors=True)
+def load_sam2_tracker(device="cuda"):
+    """
+    Load SAM2 video tracker with pretrained weights
+    Args:
+        device: Device to run on
+    Returns:
+        SAM2VideoTracker instance
+    """
+    checkpoint_path = "checkpoints/sam2/sam2.1_hiera_large.pt"
+    config_file = "configs/sam2.1/sam2.1_hiera_l.yaml"
+    print(f"Loading SAM2 from {checkpoint_path}...")
+    tracker = SAM2VideoTracker(checkpoint_path, config_file, device)
+    return tracker

sam2_wrapper_hf.py ADDED Viewed

	@@ -0,0 +1,196 @@

+"""
+SAM2 Wrapper for Video Mask Tracking - Hugging Face Space Version
+Handles mask generation and propagation through video
+"""
+import sys
+import os
+from pathlib import Path
+# Add SAM2 to path if installed
+try:
+    import sam2
+except ImportError:
+    # Try to add from common locations
+    possible_paths = [
+        "/home/cvlab19/project/samuel/CVPR/sam2",
+        "./sam2"
+    ]
+    for path in possible_paths:
+        if os.path.exists(path):
+            sys.path.append(path)
+            break
+import cv2
+import numpy as np
+import torch
+from PIL import Image
+from typing import List, Tuple
+import tempfile
+import shutil
+from sam2.build_sam import build_sam2_video_predictor
+class SAM2VideoTracker:
+    def __init__(self, checkpoint_path, config_file, device="cuda"):
+        """
+        Initialize SAM2 video tracker
+        Args:
+            checkpoint_path: Path to SAM2 checkpoint
+            config_file: Path to SAM2 config file
+            device: Device to run on
+        """
+        self.device = device
+        self.predictor = build_sam2_video_predictor(
+            config_file=config_file,
+            ckpt_path=checkpoint_path,
+            device=device
+        )
+        print(f"SAM2 video tracker initialized on {device}")
+    def track_video(self, frames: List[np.ndarray], points: List[List[int]],
+                   labels: List[int]) -> List[np.ndarray]:
+        """
+        Track object through video using SAM2
+        Args:
+            frames: List of numpy arrays, [(H,W,3)]*n, uint8 RGB frames
+            points: List of [x, y] coordinates for prompts
+            labels: List of labels (1 for positive, 0 for negative)
+        Returns:
+            masks: List of numpy arrays, [(H,W)]*n, uint8 binary masks
+        """
+        # Create temporary directory for frames
+        temp_dir = Path(tempfile.mkdtemp())
+        frames_dir = temp_dir / "frames"
+        frames_dir.mkdir(exist_ok=True)
+        try:
+            # Save frames to temp directory
+            print(f"Saving {len(frames)} frames to temporary directory...")
+            for i, frame in enumerate(frames):
+                frame_path = frames_dir / f"{i:05d}.jpg"
+                Image.fromarray(frame).save(frame_path, quality=95)
+            # Initialize SAM2 video predictor
+            print("Initializing SAM2 inference state...")
+            inference_state = self.predictor.init_state(video_path=str(frames_dir))
+            # Add prompts on first frame
+            points_array = np.array(points, dtype=np.float32)
+            labels_array = np.array(labels, dtype=np.int32)
+            print(f"Adding {len(points)} point prompts on first frame...")
+            _, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
+                inference_state=inference_state,
+                frame_idx=0,
+                obj_id=1,
+                points=points_array,
+                labels=labels_array,
+            )
+            # Propagate through video
+            print("Propagating masks through video...")
+            masks = []
+            for frame_idx, object_ids, mask_logits in self.predictor.propagate_in_video(inference_state):
+                # Get mask for object ID 1
+                obj_ids_list = object_ids.tolist() if hasattr(object_ids, 'tolist') else object_ids
+                if 1 in obj_ids_list:
+                    mask_idx = obj_ids_list.index(1)
+                    mask = (mask_logits[mask_idx] > 0.0).cpu().numpy()
+                    mask_uint8 = (mask.squeeze() * 255).astype(np.uint8)
+                    masks.append(mask_uint8)
+                else:
+                    # No mask for this frame, use empty mask
+                    h, w = frames[0].shape[:2]
+                    masks.append(np.zeros((h, w), dtype=np.uint8))
+            print(f"Generated {len(masks)} masks")
+            return masks
+        finally:
+            # Clean up temporary directory
+            shutil.rmtree(temp_dir, ignore_errors=True)
+    def get_first_frame_mask(self, frame: np.ndarray, points: List[List[int]],
+                            labels: List[int]) -> np.ndarray:
+        """
+        Get mask for first frame only (for preview)
+        Args:
+            frame: np.ndarray, (H, W, 3), uint8 RGB frame
+            points: List of [x, y] coordinates
+            labels: List of labels (1 for positive, 0 for negative)
+        Returns:
+            mask: np.ndarray, (H, W), uint8 binary mask
+        """
+        # Create temporary directory
+        temp_dir = Path(tempfile.mkdtemp())
+        frames_dir = temp_dir / "frames"
+        frames_dir.mkdir(exist_ok=True)
+        try:
+            # Save single frame
+            frame_path = frames_dir / "00000.jpg"
+            Image.fromarray(frame).save(frame_path, quality=95)
+            # Initialize SAM2
+            inference_state = self.predictor.init_state(video_path=str(frames_dir))
+            # Add prompts
+            points_array = np.array(points, dtype=np.float32)
+            labels_array = np.array(labels, dtype=np.int32)
+            _, out_obj_ids, out_mask_logits = self.predictor.add_new_points(
+                inference_state=inference_state,
+                frame_idx=0,
+                obj_id=1,
+                points=points_array,
+                labels=labels_array,
+            )
+            # Get mask
+            if len(out_mask_logits) > 0:
+                mask = (out_mask_logits[0] > 0.0).cpu().numpy()
+                mask_uint8 = (mask.squeeze() * 255).astype(np.uint8)
+                return mask_uint8
+            else:
+                return np.zeros(frame.shape[:2], dtype=np.uint8)
+        finally:
+            shutil.rmtree(temp_dir, ignore_errors=True)
+def load_sam2_tracker(checkpoint_path=None, device="cuda"):
+    """
+    Load SAM2 video tracker with pretrained weights
+    Args:
+        checkpoint_path: Path to SAM2 checkpoint (if None, uses default location)
+        device: Device to run on
+    Returns:
+        SAM2VideoTracker instance
+    """
+    # Use provided path or default
+    if checkpoint_path is None:
+        checkpoint_path = "checkpoints/sam2.1_hiera_large.pt"
+    # Config file should be in the SAM2 repo
+    config_file = "configs/sam2.1/sam2.1_hiera_l.yaml"
+    # Check if we need to use the local yaml file
+    if not os.path.exists(config_file):
+        config_file = "sam2_hiera_l.yaml"
+    print(f"Loading SAM2 from {checkpoint_path}...")
+    print(f"Using config: {config_file}")
+    tracker = SAM2VideoTracker(checkpoint_path, config_file, device)
+    return tracker

tools/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Tools module

tools/base_segmenter.py ADDED Viewed

	@@ -0,0 +1,68 @@

+"""
+SAM2 Base Segmenter
+Adapted from MatAnyone demo
+"""
+import sys
+sys.path.append("/home/cvlab19/project/samuel/CVPR/sam2")
+import torch
+import numpy as np
+from sam2.build_sam import build_sam2_video_predictor
+class BaseSegmenter:
+    def __init__(self, SAM_checkpoint, model_type, device):
+        """
+        Initialize SAM2 segmenter
+        Args:
+            SAM_checkpoint: Path to SAM2 checkpoint
+            model_type: SAM2 model config file
+            device: Device to run on
+        """
+        self.device = device
+        self.model_type = model_type
+        # Build SAM2 video predictor
+        self.sam_predictor = build_sam2_video_predictor(
+            config_file=model_type,
+            ckpt_path=SAM_checkpoint,
+            device=device
+        )
+        self.orignal_image = None
+        self.inference_state = None
+    def set_image(self, image: np.ndarray):
+        """Set the current image for segmentation"""
+        self.orignal_image = image
+    def reset_image(self):
+        """Reset the current image"""
+        self.orignal_image = None
+        self.inference_state = None
+    def predict(self, prompts, prompt_type, multimask=True):
+        """
+        Predict mask from prompts
+        Args:
+            prompts: Dictionary with point_coords, point_labels, mask_input
+            prompt_type: 'point' or 'both'
+            multimask: Whether to return multiple masks
+        Returns:
+            masks, scores, logits
+        """
+        # For SAM2, we need to handle prompts differently
+        # This is simplified - actual implementation will use video predictor
+        # Placeholder - actual SAM2 prediction would go here
+        # For now, return dummy values
+        h, w = self.orignal_image.shape[:2]
+        dummy_mask = np.zeros((h, w), dtype=bool)
+        dummy_score = np.array([1.0])
+        dummy_logit = np.zeros((h, w), dtype=np.float32)
+        return np.array([dummy_mask]), dummy_score, np.array([dummy_logit])

tools/interact_tools.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""
+SAM2 Interaction Tools
+Handles SAM2 mask generation with user clicks
+"""
+import sys
+sys.path.append("/home/cvlab19/project/samuel/CVPR/sam2")
+import numpy as np
+from PIL import Image
+from .base_segmenter import BaseSegmenter
+from .painter import mask_painter, point_painter
+mask_color = 3
+mask_alpha = 0.7
+contour_color = 1
+contour_width = 5
+point_color_ne = 8  # positive points
+point_color_ps = 50 # negative points
+point_alpha = 0.9
+point_radius = 15
+class SamControler:
+    def __init__(self, SAM_checkpoint, model_type, device):
+        """
+        Initialize SAM controller
+        Args:
+            SAM_checkpoint: Path to SAM2 checkpoint
+            model_type: SAM2 model config file
+            device: Device to run on
+        """
+        self.sam_controler = BaseSegmenter(SAM_checkpoint, model_type, device)
+        self.device = device
+    def first_frame_click(self, image: np.ndarray, points: np.ndarray,
+                         labels: np.ndarray, multimask=True, mask_color=3):
+        """
+        Generate mask from clicks on first frame
+        Args:
+            image: np.ndarray, (H, W, 3), RGB image
+            points: np.ndarray, (N, 2), [x, y] coordinates
+            labels: np.ndarray, (N,), 1 for positive, 0 for negative
+            multimask: bool, whether to generate multiple masks
+            mask_color: int, color ID for mask overlay
+        Returns:
+            mask: np.ndarray, (H, W), binary mask
+            logit: np.ndarray, (H, W), mask logits
+            painted_image: PIL.Image, visualization with mask and points
+        """
+        # Check if we have positive clicks
+        neg_flag = labels[-1]
+        if neg_flag == 1:  # Has positive click
+            # First pass with points only
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'point', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+            # Refine with mask input
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+                'mask_input': logit[None, :, :]
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'both', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        else:  # Only positive clicks
+            prompts = {
+                'point_coords': points,
+                'point_labels': labels,
+            }
+            masks, scores, logits = self.sam_controler.predict(prompts, 'point', multimask)
+            mask, logit = masks[np.argmax(scores)], logits[np.argmax(scores), :, :]
+        # Paint mask on image
+        painted_image = mask_painter(
+            image,
+            mask.astype('uint8'),
+            mask_color,
+            mask_alpha,
+            contour_color,
+            contour_width
+        )
+        # Paint positive points (label > 0)
+        positive_points = np.squeeze(points[np.argwhere(labels > 0)], axis=1)
+        if len(positive_points) > 0:
+            painted_image = point_painter(
+                painted_image,
+                positive_points,
+                point_color_ne,
+                point_alpha,
+                point_radius,
+                contour_color,
+                contour_width
+            )
+        # Paint negative points (label < 1)
+        negative_points = np.squeeze(points[np.argwhere(labels < 1)], axis=1)
+        if len(negative_points) > 0:
+            painted_image = point_painter(
+                painted_image,
+                negative_points,
+                point_color_ps,
+                point_alpha,
+                point_radius,
+                contour_color,
+                contour_width
+            )
+        painted_image = Image.fromarray(painted_image)
+        return mask, logit, painted_image

tools/painter.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""
+Mask and point painting utilities
+Adapted from MatAnyone demo
+"""
+import cv2
+import numpy as np
+from PIL import Image
+def mask_painter(input_image, input_mask, mask_color=5, mask_alpha=0.7,
+                 contour_color=1, contour_width=5):
+    """
+    Paint mask on image with transparency
+    Args:
+        input_image: np.ndarray, (H, W, 3)
+        input_mask: np.ndarray, (H, W), binary mask
+        mask_color: int, color ID for mask
+        mask_alpha: float, transparency
+        contour_color: int, color ID for contour
+        contour_width: int, width of contour
+    Returns:
+        painted_image: np.ndarray, (H, W, 3)
+    """
+    assert input_image.shape[:2] == input_mask.shape, "Image and mask must have same dimensions"
+    # Color palette
+    palette = np.array([
+        [0, 0, 0],        # 0: black
+        [255, 0, 0],      # 1: red
+        [0, 255, 0],      # 2: green
+        [0, 0, 255],      # 3: blue
+        [255, 255, 0],    # 4: yellow
+        [255, 0, 255],    # 5: magenta
+        [0, 255, 255],    # 6: cyan
+        [128, 128, 128],  # 7: gray
+        [255, 165, 0],    # 8: orange
+        [128, 0, 128],    # 9: purple
+    ])
+    mask_color_rgb = palette[mask_color % len(palette)]
+    contour_color_rgb = palette[contour_color % len(palette)]
+    # Create colored mask
+    painted_image = input_image.copy()
+    colored_mask = np.zeros_like(input_image)
+    colored_mask[input_mask > 0] = mask_color_rgb
+    # Blend with alpha
+    mask_region = input_mask > 0
+    painted_image[mask_region] = (
+        painted_image[mask_region] * (1 - mask_alpha) +
+        colored_mask[mask_region] * mask_alpha
+    ).astype(np.uint8)
+    # Draw contour
+    if contour_width > 0:
+        contours, _ = cv2.findContours(
+            input_mask.astype(np.uint8),
+            cv2.RETR_EXTERNAL,
+            cv2.CHAIN_APPROX_SIMPLE
+        )
+        cv2.drawContours(
+            painted_image,
+            contours,
+            -1,
+            contour_color_rgb.tolist(),
+            contour_width
+        )
+    return painted_image
+def point_painter(input_image, input_points, point_color=8, point_alpha=0.9,
+                  point_radius=15, contour_color=2, contour_width=3):
+    """
+    Paint points on image
+    Args:
+        input_image: np.ndarray, (H, W, 3)
+        input_points: np.ndarray, (N, 2), [x, y] coordinates
+        point_color: int, color ID for points
+        point_alpha: float, transparency
+        point_radius: int, radius of point circles
+        contour_color: int, color ID for contour
+        contour_width: int, width of contour
+    Returns:
+        painted_image: np.ndarray, (H, W, 3)
+    """
+    if len(input_points) == 0:
+        return input_image
+    palette = np.array([
+        [0, 0, 0],        # 0: black
+        [255, 0, 0],      # 1: red
+        [0, 255, 0],      # 2: green
+        [0, 0, 255],      # 3: blue
+        [255, 255, 0],    # 4: yellow
+        [255, 0, 255],    # 5: magenta
+        [0, 255, 255],    # 6: cyan
+        [128, 128, 128],  # 7: gray
+        [255, 165, 0],    # 8: orange
+        [128, 0, 128],    # 9: purple
+    ])
+    point_color_rgb = palette[point_color % len(palette)]
+    contour_color_rgb = palette[contour_color % len(palette)]
+    painted_image = input_image.copy()
+    for point in input_points:
+        x, y = int(point[0]), int(point[1])
+        # Draw filled circle with alpha blending
+        overlay = painted_image.copy()
+        cv2.circle(overlay, (x, y), point_radius, point_color_rgb.tolist(), -1)
+        cv2.addWeighted(overlay, point_alpha, painted_image, 1 - point_alpha, 0, painted_image)
+        # Draw contour
+        if contour_width > 0:
+            cv2.circle(painted_image, (x, y), point_radius, contour_color_rgb.tolist(), contour_width)
+    return painted_image

videomama_wrapper.py ADDED Viewed

	@@ -0,0 +1,88 @@

+"""
+VideoMaMa Inference Wrapper
+Handles video matting with mask conditioning
+"""
+import sys
+sys.path.append("../")
+sys.path.append("../../")
+import torch
+import numpy as np
+from PIL import Image
+from pathlib import Path
+from typing import List
+import tqdm
+from pipeline_svd_mask import VideoInferencePipeline
+def videomama(pipeline, frames_np, mask_frames_np):
+    """
+    Run VideoMaMa inference on video frames with mask conditioning
+    Args:
+        pipeline: VideoInferencePipeline instance
+        frames_np: List of numpy arrays, [(H,W,3)]*n, uint8 RGB frames
+        mask_frames_np: List of numpy arrays, [(H,W)]*n, uint8 grayscale masks
+    Returns:
+        output_frames: List of numpy arrays, [(H,W,3)]*n, uint8 RGB outputs
+    """
+    # Convert numpy arrays to PIL Images
+    frames_pil = [Image.fromarray(f) for f in frames_np]
+    mask_frames_pil = [Image.fromarray(m, mode='L') for m in mask_frames_np]
+    # Resize to model input size
+    target_width, target_height = 1024, 576
+    frames_resized = [f.resize((target_width, target_height), Image.Resampling.BILINEAR)
+                     for f in frames_pil]
+    masks_resized = [m.resize((target_width, target_height), Image.Resampling.BILINEAR)
+                    for m in mask_frames_pil]
+    # Run inference
+    print(f"Running VideoMaMa inference on {len(frames_resized)} frames...")
+    output_frames_pil = pipeline.run(
+        cond_frames=frames_resized,
+        mask_frames=masks_resized,
+        seed=42,
+        mask_cond_mode="vae"
+    )
+    # Resize back to original resolution
+    original_size = frames_pil[0].size
+    output_frames_resized = [f.resize(original_size, Image.Resampling.BILINEAR)
+                            for f in output_frames_pil]
+    # Convert back to numpy arrays
+    output_frames_np = [np.array(f) for f in output_frames_resized]
+    return output_frames_np
+def load_videomama_pipeline(device="cuda"):
+    """
+    Load VideoMaMa pipeline with pretrained weights
+    Args:
+        device: Device to run on
+    Returns:
+        VideoInferencePipeline instance
+    """
+    # Local paths for testing
+    base_model_path = "checkpoints/stable-video-diffusion-img2vid-xt"
+    unet_checkpoint_path = "checkpoints/VideoMaMa"
+    print(f"Loading VideoMaMa pipeline from {unet_checkpoint_path}...")
+    pipeline = VideoInferencePipeline(
+        base_model_path=base_model_path,
+        unet_checkpoint_path=unet_checkpoint_path,
+        weight_dtype=torch.float16,
+        device=device
+    )
+    print("VideoMaMa pipeline loaded successfully!")
+    return pipeline

videomama_wrapper_hf.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""
+VideoMaMa Inference Wrapper - Hugging Face Space Version
+Handles video matting with mask conditioning
+"""
+import sys
+import os
+from pathlib import Path
+# Add parent directories to path for imports
+sys.path.append(str(Path(__file__).parent))
+sys.path.append(str(Path(__file__).parent.parent))
+import torch
+import numpy as np
+from PIL import Image
+from typing import List
+from pipeline_svd_mask import VideoInferencePipeline
+def videomama(pipeline, frames_np, mask_frames_np):
+    """
+    Run VideoMaMa inference on video frames with mask conditioning
+    Args:
+        pipeline: VideoInferencePipeline instance
+        frames_np: List of numpy arrays, [(H,W,3)]*n, uint8 RGB frames
+        mask_frames_np: List of numpy arrays, [(H,W)]*n, uint8 grayscale masks
+    Returns:
+        output_frames: List of numpy arrays, [(H,W,3)]*n, uint8 RGB outputs
+    """
+    # Convert numpy arrays to PIL Images
+    frames_pil = [Image.fromarray(f) for f in frames_np]
+    mask_frames_pil = [Image.fromarray(m, mode='L') for m in mask_frames_np]
+    # Resize to model input size
+    target_width, target_height = 1024, 576
+    frames_resized = [f.resize((target_width, target_height), Image.Resampling.BILINEAR)
+                     for f in frames_pil]
+    masks_resized = [m.resize((target_width, target_height), Image.Resampling.BILINEAR)
+                    for m in mask_frames_pil]
+    # Run inference
+    print(f"Running VideoMaMa inference on {len(frames_resized)} frames...")
+    output_frames_pil = pipeline.run(
+        cond_frames=frames_resized,
+        mask_frames=masks_resized,
+        seed=42,
+        mask_cond_mode="vae"
+    )
+    # Resize back to original resolution
+    original_size = frames_pil[0].size
+    output_frames_resized = [f.resize(original_size, Image.Resampling.BILINEAR)
+                            for f in output_frames_pil]
+    # Convert back to numpy arrays
+    output_frames_np = [np.array(f) for f in output_frames_resized]
+    return output_frames_np
+def load_videomama_pipeline(base_model_path=None, unet_checkpoint_path=None, device="cuda"):
+    """
+    Load VideoMaMa pipeline with pretrained weights
+    Args:
+        base_model_path: Path to SVD base model (if None, uses default)
+        unet_checkpoint_path: Path to VideoMaMa UNet checkpoint (if None, uses default)
+        device: Device to run on
+    Returns:
+        VideoInferencePipeline instance
+    """
+    # Use provided paths or defaults
+    if base_model_path is None:
+        base_model_path = "checkpoints/stable-video-diffusion-img2vid-xt"
+    if unet_checkpoint_path is None:
+        unet_checkpoint_path = "checkpoints/videomama"
+    # Check if paths exist
+    if not os.path.exists(base_model_path):
+        raise FileNotFoundError(
+            f"SVD base model not found at {base_model_path}. "
+            f"Please ensure models are downloaded correctly."
+        )
+    if not os.path.exists(unet_checkpoint_path):
+        raise FileNotFoundError(
+            f"VideoMaMa checkpoint not found at {unet_checkpoint_path}. "
+            f"Please upload your VideoMaMa model to Hugging Face Hub and update the download logic."
+        )
+    print(f"Loading VideoMaMa pipeline...")
+    print(f"  Base model: {base_model_path}")
+    print(f"  UNet checkpoint: {unet_checkpoint_path}")
+    pipeline = VideoInferencePipeline(
+        base_model_path=base_model_path,
+        unet_checkpoint_path=unet_checkpoint_path,
+        weight_dtype=torch.float16,
+        device=device
+    )
+    print("VideoMaMa pipeline loaded successfully!")
+    return pipeline