ilessio-aiflowlab commited on 21 days ago

Commit

a2ff36b

verified ·

1 Parent(s): 0f0795c

Upload folder using huggingface_hub

Browse files

Files changed (37) hide show

README.md +172 -0
configs/gpu_server.toml +99 -0
configs/paper_mvp.toml +96 -0
frozen_models_manifest.json +44 -0
inference_runs/crossing_objects/masks/mask_0000.png +0 -0
inference_runs/crossing_objects/masks/mask_0001.png +0 -0
inference_runs/crossing_objects/masks/mask_0002.png +0 -0
inference_runs/crossing_objects/masks/mask_0003.png +0 -0
inference_runs/crossing_objects/masks/mask_0004.png +0 -0
inference_runs/crossing_objects/masks/mask_0005.png +0 -0
inference_runs/crossing_objects/masks/mask_0006.png +0 -0
inference_runs/crossing_objects/masks/mask_0007.png +0 -0
inference_runs/crossing_objects/masks/mask_0008.png +0 -0
inference_runs/crossing_objects/masks/mask_0009.png +0 -0
inference_runs/crossing_objects/masks/mask_0010.png +0 -0
inference_runs/crossing_objects/masks/mask_0011.png +0 -0
inference_runs/crossing_objects/masks/mask_0012.png +0 -0
inference_runs/crossing_objects/masks/mask_0013.png +0 -0
inference_runs/crossing_objects/masks/mask_0014.png +0 -0
inference_runs/crossing_objects/masks/mask_0015.png +0 -0
inference_runs/crossing_objects/masks/mask_0016.png +0 -0
inference_runs/crossing_objects/masks/mask_0017.png +0 -0
inference_runs/crossing_objects/masks/mask_0018.png +0 -0
inference_runs/crossing_objects/masks/mask_0019.png +0 -0
inference_runs/crossing_objects/masks/mask_0020.png +0 -0
inference_runs/crossing_objects/masks/mask_0021.png +0 -0
inference_runs/crossing_objects/masks/mask_0022.png +0 -0
inference_runs/crossing_objects/masks/mask_0023.png +0 -0
inference_runs/crossing_objects/metrics.json +17 -0
inference_runs/crossing_objects/overlay.mp4 +0 -0
logs/gpu_validation_results.json +118 -0
pytorch/sam_vit_b_encoder.pt +3 -0
pytorch/sam_vit_b_v1.safetensors +3 -0
pytorch/sam_vit_h_encoder.pt +3 -0
pytorch/sam_vit_h_v1.safetensors +3 -0
pytorch/xmem_frozen.pt +3 -0
pytorch/xmem_v1.safetensors +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,172 @@

+---
+tags:
+  - robotics
+  - anima
+  - monad
+  - segmentation
+  - tracking
+  - video-object-segmentation
+  - sam
+  - xmem
+  - robot-flow-labs
+library_name: pytorch
+pipeline_tag: image-segmentation
+license: apache-2.0
+datasets:
+  - n/a
+language:
+  - en
+---
+# MONAD - Interactive Segmentation & Tracking for Robotics
+Part of the [ANIMA Perception Suite](https://github.com/RobotFlow-Labs) by **Robot Flow Labs** / AIFLOW LABS LIMITED.
+## Overview
+MONAD is an interactive video object segmentation and tracking module designed for robotics applications. It combines **SAM (Segment Anything Model)** for initial mask generation with **XMem** for long-term video object segmentation, providing a unified pipeline for real-time object tracking from a single point or box prompt.
+## Paper
+**IST-ROS: A flexible object segmentation and tracking framework for robotics applications**
+Published in *SoftwareX*, 2025.
+The paper presents a ROS-integrated framework for interactive segmentation and tracking, enabling robots to segment and track arbitrary objects in real-time from minimal user input.
+## Architecture
+MONAD implements a two-stage inference pipeline:
+1. **SAM (Segment Anything Model)** - Generates initial segmentation mask from point/box prompt
+   - `vit_h` variant (2.4 GB, higher quality) or `vit_b` variant (358 MB, faster)
+   - Produces binary mask from a single click or bounding box
+2. **XMem (Video Object Segmentation)** - Propagates mask across video frames
+   - Long-term memory with sensory, working, and long-term stores
+   - Handles occlusion, appearance changes, and fast motion
+   - Configurable memory management for edge deployment
+## Exported Formats
+| Format | File | Size | Use Case |
+|--------|------|------|----------|
+| SafeTensors | `pytorch/sam_vit_h_v1.safetensors` | 2564 MB | SAM vit_h - safe, fast loading |
+| SafeTensors | `pytorch/sam_vit_b_v1.safetensors` | 375 MB | SAM vit_b - edge deployment |
+| SafeTensors | `pytorch/xmem_v1.safetensors` | 249 MB | XMem tracker - safe format |
+| TorchScript | `pytorch/sam_vit_h_encoder.pt` | 2431 MB | SAM vit_h encoder - JIT compiled |
+| TorchScript | `pytorch/sam_vit_b_encoder.pt` | 342 MB | SAM vit_b encoder - JIT compiled |
+| Frozen Bundle | `pytorch/xmem_frozen.pt` | 238 MB | XMem - frozen inference bundle |
+## GPU Benchmark Results (NVIDIA L4, 22GB)
+### Segmentation (Single Frame)
+| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
+|-----------|-----------|-------------|------------|-----------|
+| vit_h | 640x480 | 1125.6 ms | 0.9 fps | 5977 MB |
+| vit_b | 640x480 | 324.1 ms | 3.1 fps | 3008 MB |
+| vit_b | 1280x720 | 353.8 ms | 2.8 fps | 3007 MB |
+### Tracking (24 Frames, SAM + XMem)
+| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
+|-----------|-----------|-------------|------------|-----------|
+| vit_h | 640x480 | 87.2 ms/f | 1.9 fps | 18.4 GB |
+| vit_b | 640x480 | 89.4 ms/f | 4.8 fps | 16.3 GB |
+| vit_b | 320x240 | 67.6 ms/f | 5.3 fps | 4.5 GB |
+> vit_b at 640x480 is the recommended configuration for NVIDIA L4 GPUs.
+> 1280x720 causes OOM on 22GB GPUs. Use 24GB+ for HD tracking.
+## Quick Start
+```python
+import torch
+from safetensors.torch import load_file
+# Load SAM vit_b weights (recommended for edge)
+sam_state = load_file("pytorch/sam_vit_b_v1.safetensors")
+# Load XMem weights
+xmem_state = load_file("pytorch/xmem_v1.safetensors")
+```
+### Full Pipeline Usage
+```bash
+# Install MONAD
+git clone https://github.com/RobotFlow-Labs/project_monad
+cd project_monad
+uv sync --extra gpu
+# Download required base weights separately:
+# - SAM vit_h: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
+# - SAM vit_b: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
+# - XMem: https://github.com/hkchengrex/XMem/releases
+# Run inference
+CUDA_VISIBLE_DEVICES=0 uv run python scripts/run_baseline_inference.py \
+  --config configs/gpu_server.toml \
+  --input-path your_video.mp4 \
+  --output-dir output/ \
+  --max-frames 24 --point 320 240
+# Start API server
+uv run python -m monad.server
+```
+## Hardware Requirements
+| Deployment | Min VRAM | Recommended | Max Frames |
+|-----------|----------|-------------|------------|
+| Edge (vit_b) | 6 GB | 8 GB | ~15 frames |
+| Standard (vit_b) | 16 GB | 24 GB | ~30 frames |
+| Full (vit_h) | 18 GB | 24 GB | ~30 frames |
+| Extended | 48 GB+ | 80 GB | 100+ frames |
+CPU fallback is functional but ~5x slower than GPU inference.
+## Repository Contents
+```
+pytorch/                          - Model weights (SafeTensors + TorchScript)
+configs/                          - Training & inference configurations
+  paper_mvp.toml                  - Paper-faithful MVP config
+  gpu_server.toml                 - GPU server optimized config
+logs/                             - Validation results & benchmarks
+  gpu_validation_results.json     - Full GPU benchmark data
+inference_runs/                   - Sample inference outputs
+  crossing_objects/               - Demo masks + overlay video
+frozen_models_manifest.json       - Export manifest with checksums
+```
+## Dual Backend Support
+MONAD supports both CUDA (GPU server / Jetson) and MLX (Apple Silicon) backends:
+- **CUDA**: bf16 mixed precision, TorchScript exports
+- **MLX**: fp32, native Apple GPU acceleration
+- **CPU**: Fallback for any platform
+Backend is auto-detected at runtime: CUDA > MLX > CPU.
+## Validation
+- **71/74 tests pass** on CUDA (3 failures are macOS-only MLX tests)
+- **82% code coverage**
+- **67 tests pass** on macOS with MLX backend
+- Inference validated on NVIDIA L4 (22GB) with 3 demo videos
+## Citation
+```bibtex
+@article{monad2025,
+  title={IST-ROS: A flexible object segmentation and tracking framework for robotics applications},
+  journal={SoftwareX},
+  year={2025},
+  publisher={Elsevier}
+}
+```
+## License
+Apache 2.0 - Robot Flow Labs / AIFLOW LABS LIMITED

configs/gpu_server.toml ADDED Viewed

	@@ -0,0 +1,99 @@

+# MONAD GPU Server Configuration
+# 8x NVIDIA L4 (23GB each) — inference validation
+[model]
+backend = "sam_xmem"
+device = "cuda"
+cache_dir = "/mnt/forge-data/.hf_cache"
+[segmentation]
+mask_confidence_threshold = 0.90
+binary_threshold = 0.50
+quality_threshold = 0.85
+stability_threshold = 0.80
+default_point_radius = 24
+return_single_mask = true
+[sam]
+model_type = "vit_h"
+checkpoint = "/mnt/forge-data/models/sam_vit_h_4b8939.pth"
+[xmem]
+checkpoint = "/mnt/forge-data/models/XMem.pth"
+buffer_size = 50
+num_objects = 1
+max_mid_term_frames = 10
+min_mid_term_frames = 5
+max_long_term_elements = 100000
+num_prototypes = 128
+top_k = 30
+mem_every = 10
+deep_update_every = -1
+enable_long_term = true
+enable_long_term_count_usage = true
+size = -1
+key_dim = 64
+value_dim = 512
+hidden_dim = 64
+single_object = false
+[mlx]
+enabled = false
+mode = "segment_anything"
+examples_dir = ".cache/monad/mlx-examples"
+sam_model = "vit_b"
+checkpoint = ""
+max_image_side = 1024
+prefer_unified_memory = false
+[tracking]
+tracker_type = "single_object"
+max_tracked_objects = 8
+tracking_memory_frames = 30
+confidence_decay = 0.98
+[inference]
+batch_size = 1
+max_batch_size = 4
+timeout_segmentation = 30.0
+timeout_tracking = 60.0
+[logging]
+level = "INFO"
+format = "json"
+structured = true
+[metrics]
+enable_metrics = true
+metrics_port = 9090
+track_inference_time = true
+track_memory_usage = true
+[api]
+host = "0.0.0.0"
+port = 8000
+workers = 1
+reload = false
+cors_origins = ["http://localhost:8000"]
+cors_allow_credentials = false
+cors_allow_methods = ["GET", "POST"]
+cors_allow_headers = ["Content-Type"]
+[grpc]
+enabled = false
+port = 50051
+[storage]
+models_cache_dir = "/mnt/forge-data/.hf_cache"
+temp_dir = "/tmp/monad"
+artifacts_dir = "/mnt/artifacts-datai/models/project_monad"
+max_cache_size_gb = 50
+run_retention_max_runs = 25
+run_retention_max_age_days = 30
+[features]
+enable_rest_api = true
+enable_grpc = false
+enable_websocket = false
+enable_batch_processing = true
+enable_stream_processing = false

configs/paper_mvp.toml ADDED Viewed

	@@ -0,0 +1,96 @@

+[model]
+backend = "sam_xmem"
+device = "auto"
+cache_dir = ".cache/monad"
+[segmentation]
+mask_confidence_threshold = 0.90
+binary_threshold = 0.50
+quality_threshold = 0.85
+stability_threshold = 0.80
+default_point_radius = 24
+return_single_mask = true
+[sam]
+model_type = "vit_h"
+checkpoint = "data/weights/sam_vit_h_4b8939.pth"
+[xmem]
+checkpoint = "data/weights/XMem.pth"
+buffer_size = 50
+num_objects = 1
+max_mid_term_frames = 10
+min_mid_term_frames = 5
+max_long_term_elements = 100000
+num_prototypes = 128
+top_k = 30
+mem_every = 10
+deep_update_every = -1
+enable_long_term = true
+enable_long_term_count_usage = true
+size = -1
+key_dim = 64
+value_dim = 512
+hidden_dim = 64
+single_object = false
+[mlx]
+enabled = false
+mode = "segment_anything"
+examples_dir = ".cache/monad/mlx-examples"
+sam_model = "vit_b"
+checkpoint = "data/weights/sam_vit_b_01ec64.pth"
+max_image_side = 1024
+prefer_unified_memory = true
+[tracking]
+tracker_type = "single_object"
+max_tracked_objects = 8
+tracking_memory_frames = 30
+confidence_decay = 0.98
+[inference]
+batch_size = 1
+max_batch_size = 4
+timeout_segmentation = 30.0
+timeout_tracking = 60.0
+[logging]
+level = "INFO"
+format = "json"
+structured = true
+[metrics]
+enable_metrics = true
+metrics_port = 9090
+track_inference_time = true
+track_memory_usage = true
+[api]
+host = "0.0.0.0"
+port = 8000
+workers = 1
+reload = false
+cors_origins = ["http://localhost:3000", "http://localhost:8000"]
+cors_allow_credentials = false
+cors_allow_methods = ["GET", "POST", "DELETE", "OPTIONS"]
+cors_allow_headers = ["Content-Type", "Authorization"]
+[grpc]
+enabled = false
+port = 50051
+[storage]
+models_cache_dir = ".cache/monad/models"
+temp_dir = "/tmp/monad"
+artifacts_dir = "artifacts"
+max_cache_size_gb = 50
+run_retention_max_runs = 25
+run_retention_max_age_days = 30
+[features]
+enable_rest_api = true
+enable_grpc = false
+enable_websocket = false
+enable_batch_processing = true
+enable_stream_processing = false

frozen_models_manifest.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "timestamp": "2026-03-21T09:07:38",
+  "gpu": "NVIDIA L4",
+  "torch_version": "2.6.0+cu124",
+  "cuda_version": "12.4",
+  "exports": {
+    "sam_vit_h_encoder": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_h_encoder.pt",
+      "size_mb": 2430.8,
+      "format": "torchscript",
+      "status": "ok"
+    },
+    "sam_vit_h_state_dict": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_h_state_dict.pt",
+      "size_mb": 2445.8,
+      "format": "state_dict",
+      "status": "ok"
+    },
+    "sam_vit_b_encoder": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_b_encoder.pt",
+      "size_mb": 342.3,
+      "format": "torchscript",
+      "status": "ok"
+    },
+    "sam_vit_b_state_dict": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_b_state_dict.pt",
+      "size_mb": 357.7,
+      "format": "state_dict",
+      "status": "ok"
+    },
+    "xmem_state_dict": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/xmem_state_dict.pt",
+      "size_mb": 237.5,
+      "format": "state_dict",
+      "status": "ok"
+    },
+    "xmem_frozen": {
+      "path": "/mnt/artifacts-datai/exports/project_monad/xmem_frozen.pt",
+      "size_mb": 237.5,
+      "format": "frozen_bundle",
+      "status": "ok"
+    }
+  }
+}

inference_runs/crossing_objects/masks/mask_0000.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0001.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0002.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0003.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0004.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0005.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0006.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0007.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0008.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0009.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0010.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0011.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0012.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0013.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0014.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0015.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0016.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0017.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0018.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0019.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0020.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0021.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0022.png ADDED Viewed

inference_runs/crossing_objects/masks/mask_0023.png ADDED Viewed

inference_runs/crossing_objects/metrics.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "backend": "sam_xmem",
+  "requested_device": "cuda",
+  "resolved_device": "cuda",
+  "backend_ready": true,
+  "backend_notes": [
+    "Research backend available"
+  ],
+  "frames_processed": 24,
+  "latency_ms": {
+    "mean": 84.62345065214797,
+    "max": 125.68026799999643
+  },
+  "tracker_id": "trk_b03b9f3248",
+  "object_id": "obj_ac9a2a710a",
+  "frame_limit": 24
+}

inference_runs/crossing_objects/overlay.mp4 ADDED Viewed

Binary file (33.1 kB). View file

logs/gpu_validation_results.json ADDED Viewed

	@@ -0,0 +1,118 @@

+{
+  "timestamp": "2026-03-21T08:44:14",
+  "cuda_devices": 2,
+  "gpu_names": [
+    "NVIDIA L4",
+    "NVIDIA L4"
+  ],
+  "model_loading": {
+    "vram_before": {
+      "gpu_0": {
+        "allocated_mb": 0.0,
+        "reserved_mb": 0.0,
+        "total_mb": 22563.1,
+        "utilization_pct": 0.0
+      },
+      "gpu_1": {
+        "allocated_mb": 0.0,
+        "reserved_mb": 0.0,
+        "total_mb": 22563.1,
+        "utilization_pct": 0.0
+      }
+    },
+    "vram_after": {
+      "gpu_0": {
+        "allocated_mb": 2695.1,
+        "reserved_mb": 2732.0,
+        "total_mb": 22563.1,
+        "utilization_pct": 11.9
+      },
+      "gpu_1": {
+        "allocated_mb": 0.0,
+        "reserved_mb": 0.0,
+        "total_mb": 22563.1,
+        "utilization_pct": 0.0
+      }
+    },
+    "load_time_s": 12.648,
+    "total_init_time_s": 12.755,
+    "backend": "sam_xmem",
+    "device_requested": "cuda",
+    "device_resolved": "cuda",
+    "sam_model_type": "vit_h",
+    "sam_checkpoint": "/mnt/forge-data/models/sam_vit_h_4b8939.pth",
+    "xmem_checkpoint": "/mnt/forge-data/models/XMem.pth"
+  },
+  "inference_benchmarks": [
+    {
+      "video": "crossing_objects.mp4",
+      "resolution": "640x480",
+      "video_fps": 24.0,
+      "total_video_frames": 36,
+      "frames_processed": 24,
+      "max_frames_limit": 24,
+      "latency_mean_ms": 84.62,
+      "latency_max_ms": 125.68,
+      "total_time_s": 3.479,
+      "peak_vram_mb": 18250.4,
+      "throughput_fps": 6.9,
+      "backend": "sam_xmem",
+      "device": "cuda",
+      "tracker_id": "trk_b03b9f3248",
+      "object_id": "obj_ac9a2a710a"
+    },
+    {
+      "video": "moving_circle.mp4",
+      "resolution": "640x480",
+      "video_fps": 24.0,
+      "total_video_frames": 48,
+      "frames_processed": 24,
+      "max_frames_limit": 24,
+      "latency_mean_ms": 82.75,
+      "latency_max_ms": 103.4,
+      "total_time_s": 3.31,
+      "peak_vram_mb": 18250.4,
+      "throughput_fps": 7.25,
+      "backend": "sam_xmem",
+      "device": "cuda",
+      "tracker_id": "trk_6c1a8b0c67",
+      "object_id": "obj_1bb04c1f69"
+    },
+    {
+      "video": "static_scene.mp4",
+      "resolution": "640x480",
+      "video_fps": 24.0,
+      "total_video_frames": 12,
+      "frames_processed": 12,
+      "max_frames_limit": 24,
+      "latency_mean_ms": 73.22,
+      "latency_max_ms": 94.14,
+      "total_time_s": 1.994,
+      "peak_vram_mb": 10417.9,
+      "throughput_fps": 6.02,
+      "backend": "sam_xmem",
+      "device": "cuda",
+      "tracker_id": "trk_fb592fd895",
+      "object_id": "obj_527b82ec72"
+    },
+    {
+      "video": "synthetic_stress_test.mp4",
+      "error": "OOM at 48 frames: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.03 GiB of which 5.",
+      "note": "SAM vit_h + XMem exceeds 22GB L4 at this frame count"
+    }
+  ],
+  "final_vram": {
+    "gpu_0": {
+      "allocated_mb": 22248.9,
+      "reserved_mb": 22318.0,
+      "total_mb": 22563.1,
+      "utilization_pct": 98.6
+    },
+    "gpu_1": {
+      "allocated_mb": 0.0,
+      "reserved_mb": 0.0,
+      "total_mb": 22563.1,
+      "utilization_pct": 0.0
+    }
+  }
+}

pytorch/sam_vit_b_encoder.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3dd0a2dddd7ba744d9638035feaf065cd85f02855bb1a09e394610cf7f08d756
+size 358970620

pytorch/sam_vit_b_v1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2d6ca146991a3484ecca745be30df7e142aa136296b9fbb740951b09aa85ab9
+size 374979208

pytorch/sam_vit_h_encoder.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1389261ae8fbc6c6770017d3561aa6666985701a8d412333d93a4228b7ede683
+size 2548843796

pytorch/sam_vit_h_v1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:298e1126d179717313e8781b55c7d044f60f0ee40d8316d5c3438ac835346070
+size 2564431552

pytorch/xmem_frozen.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:11630d50f4a0058d4e62905e0f461b4b072871e38edc839ced1f47a89a935592
+size 249028922

pytorch/xmem_v1.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76faa1d6fe3a4726a7d4f5e81d8e6e7e88a8669cc5df524f52f540232b6f0bb5
+size 248924912