MONAD - Interactive Segmentation & Tracking for Robotics
Part of the ANIMA Perception Suite by Robot Flow Labs / AIFLOW LABS LIMITED.
Overview
MONAD is an interactive video object segmentation and tracking module designed for robotics applications. It combines SAM (Segment Anything Model) for initial mask generation with XMem for long-term video object segmentation, providing a unified pipeline for real-time object tracking from a single point or box prompt.
Paper
IST-ROS: A flexible object segmentation and tracking framework for robotics applications Published in SoftwareX, 2025.
The paper presents a ROS-integrated framework for interactive segmentation and tracking, enabling robots to segment and track arbitrary objects in real-time from minimal user input.
Architecture
MONAD implements a two-stage inference pipeline:
SAM (Segment Anything Model) - Generates initial segmentation mask from point/box prompt
vit_hvariant (2.4 GB, higher quality) orvit_bvariant (358 MB, faster)- Produces binary mask from a single click or bounding box
XMem (Video Object Segmentation) - Propagates mask across video frames
- Long-term memory with sensory, working, and long-term stores
- Handles occlusion, appearance changes, and fast motion
- Configurable memory management for edge deployment
Exported Formats
| Format | File | Size | Use Case |
|---|---|---|---|
| SafeTensors | pytorch/sam_vit_h_v1.safetensors |
2564 MB | SAM vit_h - safe, fast loading |
| SafeTensors | pytorch/sam_vit_b_v1.safetensors |
375 MB | SAM vit_b - edge deployment |
| SafeTensors | pytorch/xmem_v1.safetensors |
249 MB | XMem tracker - safe format |
| TorchScript | pytorch/sam_vit_h_encoder.pt |
2431 MB | SAM vit_h encoder - JIT compiled |
| TorchScript | pytorch/sam_vit_b_encoder.pt |
342 MB | SAM vit_b encoder - JIT compiled |
| Frozen Bundle | pytorch/xmem_frozen.pt |
238 MB | XMem - frozen inference bundle |
GPU Benchmark Results (NVIDIA L4, 22GB)
Segmentation (Single Frame)
| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
|---|---|---|---|---|
| vit_h | 640x480 | 1125.6 ms | 0.9 fps | 5977 MB |
| vit_b | 640x480 | 324.1 ms | 3.1 fps | 3008 MB |
| vit_b | 1280x720 | 353.8 ms | 2.8 fps | 3007 MB |
Tracking (24 Frames, SAM + XMem)
| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
|---|---|---|---|---|
| vit_h | 640x480 | 87.2 ms/f | 1.9 fps | 18.4 GB |
| vit_b | 640x480 | 89.4 ms/f | 4.8 fps | 16.3 GB |
| vit_b | 320x240 | 67.6 ms/f | 5.3 fps | 4.5 GB |
vit_b at 640x480 is the recommended configuration for NVIDIA L4 GPUs. 1280x720 causes OOM on 22GB GPUs. Use 24GB+ for HD tracking.
Quick Start
import torch
from safetensors.torch import load_file
# Load SAM vit_b weights (recommended for edge)
sam_state = load_file("pytorch/sam_vit_b_v1.safetensors")
# Load XMem weights
xmem_state = load_file("pytorch/xmem_v1.safetensors")
Full Pipeline Usage
# Install MONAD
git clone https://github.com/RobotFlow-Labs/project_monad
cd project_monad
uv sync --extra gpu
# Download required base weights separately:
# - SAM vit_h: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
# - SAM vit_b: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
# - XMem: https://github.com/hkchengrex/XMem/releases
# Run inference
CUDA_VISIBLE_DEVICES=0 uv run python scripts/run_baseline_inference.py \
--config configs/gpu_server.toml \
--input-path your_video.mp4 \
--output-dir output/ \
--max-frames 24 --point 320 240
# Start API server
uv run python -m monad.server
Hardware Requirements
| Deployment | Min VRAM | Recommended | Max Frames |
|---|---|---|---|
| Edge (vit_b) | 6 GB | 8 GB | ~15 frames |
| Standard (vit_b) | 16 GB | 24 GB | ~30 frames |
| Full (vit_h) | 18 GB | 24 GB | ~30 frames |
| Extended | 48 GB+ | 80 GB | 100+ frames |
CPU fallback is functional but ~5x slower than GPU inference.
Repository Contents
pytorch/ - Model weights (SafeTensors + TorchScript)
configs/ - Training & inference configurations
paper_mvp.toml - Paper-faithful MVP config
gpu_server.toml - GPU server optimized config
logs/ - Validation results & benchmarks
gpu_validation_results.json - Full GPU benchmark data
inference_runs/ - Sample inference outputs
crossing_objects/ - Demo masks + overlay video
frozen_models_manifest.json - Export manifest with checksums
Dual Backend Support
MONAD supports both CUDA (GPU server / Jetson) and MLX (Apple Silicon) backends:
- CUDA: bf16 mixed precision, TorchScript exports
- MLX: fp32, native Apple GPU acceleration
- CPU: Fallback for any platform
Backend is auto-detected at runtime: CUDA > MLX > CPU.
Validation
- 71/74 tests pass on CUDA (3 failures are macOS-only MLX tests)
- 82% code coverage
- 67 tests pass on macOS with MLX backend
- Inference validated on NVIDIA L4 (22GB) with 3 demo videos
Citation
@article{monad2025,
title={IST-ROS: A flexible object segmentation and tracking framework for robotics applications},
journal={SoftwareX},
year={2025},
publisher={Elsevier}
}
License
Apache 2.0 - Robot Flow Labs / AIFLOW LABS LIMITED