MONAD - Interactive Segmentation & Tracking for Robotics

Part of the ANIMA Perception Suite by Robot Flow Labs / AIFLOW LABS LIMITED.

Overview

MONAD is an interactive video object segmentation and tracking module designed for robotics applications. It combines SAM (Segment Anything Model) for initial mask generation with XMem for long-term video object segmentation, providing a unified pipeline for real-time object tracking from a single point or box prompt.

Paper

IST-ROS: A flexible object segmentation and tracking framework for robotics applications Published in SoftwareX, 2025.

The paper presents a ROS-integrated framework for interactive segmentation and tracking, enabling robots to segment and track arbitrary objects in real-time from minimal user input.

Architecture

MONAD implements a two-stage inference pipeline:

SAM (Segment Anything Model) - Generates initial segmentation mask from point/box prompt
- vit_h variant (2.4 GB, higher quality) or vit_b variant (358 MB, faster)
- Produces binary mask from a single click or bounding box
XMem (Video Object Segmentation) - Propagates mask across video frames
- Long-term memory with sensory, working, and long-term stores
- Handles occlusion, appearance changes, and fast motion
- Configurable memory management for edge deployment

Exported Formats

Format	File	Size	Use Case
SafeTensors	`pytorch/sam_vit_h_v1.safetensors`	2564 MB	SAM vit_h - safe, fast loading
SafeTensors	`pytorch/sam_vit_b_v1.safetensors`	375 MB	SAM vit_b - edge deployment
SafeTensors	`pytorch/xmem_v1.safetensors`	249 MB	XMem tracker - safe format
TorchScript	`pytorch/sam_vit_h_encoder.pt`	2431 MB	SAM vit_h encoder - JIT compiled
TorchScript	`pytorch/sam_vit_b_encoder.pt`	342 MB	SAM vit_b encoder - JIT compiled
Frozen Bundle	`pytorch/xmem_frozen.pt`	238 MB	XMem - frozen inference bundle

GPU Benchmark Results (NVIDIA L4, 22GB)

Segmentation (Single Frame)

SAM Model	Resolution	Mean Latency	Throughput	Peak VRAM
vit_h	640x480	1125.6 ms	0.9 fps	5977 MB
vit_b	640x480	324.1 ms	3.1 fps	3008 MB
vit_b	1280x720	353.8 ms	2.8 fps	3007 MB

Tracking (24 Frames, SAM + XMem)

SAM Model	Resolution	Mean Latency	Throughput	Peak VRAM
vit_h	640x480	87.2 ms/f	1.9 fps	18.4 GB
vit_b	640x480	89.4 ms/f	4.8 fps	16.3 GB
vit_b	320x240	67.6 ms/f	5.3 fps	4.5 GB

vit_b at 640x480 is the recommended configuration for NVIDIA L4 GPUs. 1280x720 causes OOM on 22GB GPUs. Use 24GB+ for HD tracking.

Quick Start

import torch
from safetensors.torch import load_file

# Load SAM vit_b weights (recommended for edge)
sam_state = load_file("pytorch/sam_vit_b_v1.safetensors")

# Load XMem weights
xmem_state = load_file("pytorch/xmem_v1.safetensors")

Full Pipeline Usage

# Install MONAD
git clone https://github.com/RobotFlow-Labs/project_monad
cd project_monad
uv sync --extra gpu

# Download required base weights separately:
# - SAM vit_h: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# - SAM vit_b: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
# - XMem: https://github.com/hkchengrex/XMem/releases

# Run inference
CUDA_VISIBLE_DEVICES=0 uv run python scripts/run_baseline_inference.py \
  --config configs/gpu_server.toml \
  --input-path your_video.mp4 \
  --output-dir output/ \
  --max-frames 24 --point 320 240

# Start API server
uv run python -m monad.server

Hardware Requirements

Deployment	Min VRAM	Recommended	Max Frames
Edge (vit_b)	6 GB	8 GB	~15 frames
Standard (vit_b)	16 GB	24 GB	~30 frames
Full (vit_h)	18 GB	24 GB	~30 frames
Extended	48 GB+	80 GB	100+ frames

CPU fallback is functional but ~5x slower than GPU inference.

Repository Contents

pytorch/                          - Model weights (SafeTensors + TorchScript)
configs/                          - Training & inference configurations
  paper_mvp.toml                  - Paper-faithful MVP config
  gpu_server.toml                 - GPU server optimized config
logs/                             - Validation results & benchmarks
  gpu_validation_results.json     - Full GPU benchmark data
inference_runs/                   - Sample inference outputs
  crossing_objects/               - Demo masks + overlay video
frozen_models_manifest.json       - Export manifest with checksums

Dual Backend Support

MONAD supports both CUDA (GPU server / Jetson) and MLX (Apple Silicon) backends:

CUDA: bf16 mixed precision, TorchScript exports
MLX: fp32, native Apple GPU acceleration
CPU: Fallback for any platform

Backend is auto-detected at runtime: CUDA > MLX > CPU.

Validation

71/74 tests pass on CUDA (3 failures are macOS-only MLX tests)
82% code coverage
67 tests pass on macOS with MLX backend
Inference validated on NVIDIA L4 (22GB) with 3 demo videos

Citation

@article{monad2025,
  title={IST-ROS: A flexible object segmentation and tracking framework for robotics applications},
  journal={SoftwareX},
  year={2025},
  publisher={Elsevier}
}

License

Apache 2.0 - Robot Flow Labs / AIFLOW LABS LIMITED

Downloads last month: -; Downloads are not tracked for this model. How to track