MONAD - Interactive Segmentation & Tracking for Robotics

Part of the ANIMA Perception Suite by Robot Flow Labs / AIFLOW LABS LIMITED.

Overview

MONAD is an interactive video object segmentation and tracking module designed for robotics applications. It combines SAM (Segment Anything Model) for initial mask generation with XMem for long-term video object segmentation, providing a unified pipeline for real-time object tracking from a single point or box prompt.

Paper

IST-ROS: A flexible object segmentation and tracking framework for robotics applications Published in SoftwareX, 2025.

The paper presents a ROS-integrated framework for interactive segmentation and tracking, enabling robots to segment and track arbitrary objects in real-time from minimal user input.

Architecture

MONAD implements a two-stage inference pipeline:

  1. SAM (Segment Anything Model) - Generates initial segmentation mask from point/box prompt

    • vit_h variant (2.4 GB, higher quality) or vit_b variant (358 MB, faster)
    • Produces binary mask from a single click or bounding box
  2. XMem (Video Object Segmentation) - Propagates mask across video frames

    • Long-term memory with sensory, working, and long-term stores
    • Handles occlusion, appearance changes, and fast motion
    • Configurable memory management for edge deployment

Exported Formats

Format File Size Use Case
SafeTensors pytorch/sam_vit_h_v1.safetensors 2564 MB SAM vit_h - safe, fast loading
SafeTensors pytorch/sam_vit_b_v1.safetensors 375 MB SAM vit_b - edge deployment
SafeTensors pytorch/xmem_v1.safetensors 249 MB XMem tracker - safe format
TorchScript pytorch/sam_vit_h_encoder.pt 2431 MB SAM vit_h encoder - JIT compiled
TorchScript pytorch/sam_vit_b_encoder.pt 342 MB SAM vit_b encoder - JIT compiled
Frozen Bundle pytorch/xmem_frozen.pt 238 MB XMem - frozen inference bundle

GPU Benchmark Results (NVIDIA L4, 22GB)

Segmentation (Single Frame)

SAM Model Resolution Mean Latency Throughput Peak VRAM
vit_h 640x480 1125.6 ms 0.9 fps 5977 MB
vit_b 640x480 324.1 ms 3.1 fps 3008 MB
vit_b 1280x720 353.8 ms 2.8 fps 3007 MB

Tracking (24 Frames, SAM + XMem)

SAM Model Resolution Mean Latency Throughput Peak VRAM
vit_h 640x480 87.2 ms/f 1.9 fps 18.4 GB
vit_b 640x480 89.4 ms/f 4.8 fps 16.3 GB
vit_b 320x240 67.6 ms/f 5.3 fps 4.5 GB

vit_b at 640x480 is the recommended configuration for NVIDIA L4 GPUs. 1280x720 causes OOM on 22GB GPUs. Use 24GB+ for HD tracking.

Quick Start

import torch
from safetensors.torch import load_file

# Load SAM vit_b weights (recommended for edge)
sam_state = load_file("pytorch/sam_vit_b_v1.safetensors")

# Load XMem weights
xmem_state = load_file("pytorch/xmem_v1.safetensors")

Full Pipeline Usage

# Install MONAD
git clone https://github.com/RobotFlow-Labs/project_monad
cd project_monad
uv sync --extra gpu

# Download required base weights separately:
# - SAM vit_h: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# - SAM vit_b: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
# - XMem: https://github.com/hkchengrex/XMem/releases

# Run inference
CUDA_VISIBLE_DEVICES=0 uv run python scripts/run_baseline_inference.py \
  --config configs/gpu_server.toml \
  --input-path your_video.mp4 \
  --output-dir output/ \
  --max-frames 24 --point 320 240

# Start API server
uv run python -m monad.server

Hardware Requirements

Deployment Min VRAM Recommended Max Frames
Edge (vit_b) 6 GB 8 GB ~15 frames
Standard (vit_b) 16 GB 24 GB ~30 frames
Full (vit_h) 18 GB 24 GB ~30 frames
Extended 48 GB+ 80 GB 100+ frames

CPU fallback is functional but ~5x slower than GPU inference.

Repository Contents

pytorch/                          - Model weights (SafeTensors + TorchScript)
configs/                          - Training & inference configurations
  paper_mvp.toml                  - Paper-faithful MVP config
  gpu_server.toml                 - GPU server optimized config
logs/                             - Validation results & benchmarks
  gpu_validation_results.json     - Full GPU benchmark data
inference_runs/                   - Sample inference outputs
  crossing_objects/               - Demo masks + overlay video
frozen_models_manifest.json       - Export manifest with checksums

Dual Backend Support

MONAD supports both CUDA (GPU server / Jetson) and MLX (Apple Silicon) backends:

  • CUDA: bf16 mixed precision, TorchScript exports
  • MLX: fp32, native Apple GPU acceleration
  • CPU: Fallback for any platform

Backend is auto-detected at runtime: CUDA > MLX > CPU.

Validation

  • 71/74 tests pass on CUDA (3 failures are macOS-only MLX tests)
  • 82% code coverage
  • 67 tests pass on macOS with MLX backend
  • Inference validated on NVIDIA L4 (22GB) with 3 demo videos

Citation

@article{monad2025,
  title={IST-ROS: A flexible object segmentation and tracking framework for robotics applications},
  journal={SoftwareX},
  year={2025},
  publisher={Elsevier}
}

License

Apache 2.0 - Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support