Upload folder using huggingface_hub
Browse files- README.md +172 -0
- configs/gpu_server.toml +99 -0
- configs/paper_mvp.toml +96 -0
- frozen_models_manifest.json +44 -0
- inference_runs/crossing_objects/masks/mask_0000.png +0 -0
- inference_runs/crossing_objects/masks/mask_0001.png +0 -0
- inference_runs/crossing_objects/masks/mask_0002.png +0 -0
- inference_runs/crossing_objects/masks/mask_0003.png +0 -0
- inference_runs/crossing_objects/masks/mask_0004.png +0 -0
- inference_runs/crossing_objects/masks/mask_0005.png +0 -0
- inference_runs/crossing_objects/masks/mask_0006.png +0 -0
- inference_runs/crossing_objects/masks/mask_0007.png +0 -0
- inference_runs/crossing_objects/masks/mask_0008.png +0 -0
- inference_runs/crossing_objects/masks/mask_0009.png +0 -0
- inference_runs/crossing_objects/masks/mask_0010.png +0 -0
- inference_runs/crossing_objects/masks/mask_0011.png +0 -0
- inference_runs/crossing_objects/masks/mask_0012.png +0 -0
- inference_runs/crossing_objects/masks/mask_0013.png +0 -0
- inference_runs/crossing_objects/masks/mask_0014.png +0 -0
- inference_runs/crossing_objects/masks/mask_0015.png +0 -0
- inference_runs/crossing_objects/masks/mask_0016.png +0 -0
- inference_runs/crossing_objects/masks/mask_0017.png +0 -0
- inference_runs/crossing_objects/masks/mask_0018.png +0 -0
- inference_runs/crossing_objects/masks/mask_0019.png +0 -0
- inference_runs/crossing_objects/masks/mask_0020.png +0 -0
- inference_runs/crossing_objects/masks/mask_0021.png +0 -0
- inference_runs/crossing_objects/masks/mask_0022.png +0 -0
- inference_runs/crossing_objects/masks/mask_0023.png +0 -0
- inference_runs/crossing_objects/metrics.json +17 -0
- inference_runs/crossing_objects/overlay.mp4 +0 -0
- logs/gpu_validation_results.json +118 -0
- pytorch/sam_vit_b_encoder.pt +3 -0
- pytorch/sam_vit_b_v1.safetensors +3 -0
- pytorch/sam_vit_h_encoder.pt +3 -0
- pytorch/sam_vit_h_v1.safetensors +3 -0
- pytorch/xmem_frozen.pt +3 -0
- pytorch/xmem_v1.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- robotics
|
| 4 |
+
- anima
|
| 5 |
+
- monad
|
| 6 |
+
- segmentation
|
| 7 |
+
- tracking
|
| 8 |
+
- video-object-segmentation
|
| 9 |
+
- sam
|
| 10 |
+
- xmem
|
| 11 |
+
- robot-flow-labs
|
| 12 |
+
library_name: pytorch
|
| 13 |
+
pipeline_tag: image-segmentation
|
| 14 |
+
license: apache-2.0
|
| 15 |
+
datasets:
|
| 16 |
+
- n/a
|
| 17 |
+
language:
|
| 18 |
+
- en
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# MONAD - Interactive Segmentation & Tracking for Robotics
|
| 22 |
+
|
| 23 |
+
Part of the [ANIMA Perception Suite](https://github.com/RobotFlow-Labs) by **Robot Flow Labs** / AIFLOW LABS LIMITED.
|
| 24 |
+
|
| 25 |
+
## Overview
|
| 26 |
+
|
| 27 |
+
MONAD is an interactive video object segmentation and tracking module designed for robotics applications. It combines **SAM (Segment Anything Model)** for initial mask generation with **XMem** for long-term video object segmentation, providing a unified pipeline for real-time object tracking from a single point or box prompt.
|
| 28 |
+
|
| 29 |
+
## Paper
|
| 30 |
+
|
| 31 |
+
**IST-ROS: A flexible object segmentation and tracking framework for robotics applications**
|
| 32 |
+
Published in *SoftwareX*, 2025.
|
| 33 |
+
|
| 34 |
+
The paper presents a ROS-integrated framework for interactive segmentation and tracking, enabling robots to segment and track arbitrary objects in real-time from minimal user input.
|
| 35 |
+
|
| 36 |
+
## Architecture
|
| 37 |
+
|
| 38 |
+
MONAD implements a two-stage inference pipeline:
|
| 39 |
+
|
| 40 |
+
1. **SAM (Segment Anything Model)** - Generates initial segmentation mask from point/box prompt
|
| 41 |
+
- `vit_h` variant (2.4 GB, higher quality) or `vit_b` variant (358 MB, faster)
|
| 42 |
+
- Produces binary mask from a single click or bounding box
|
| 43 |
+
|
| 44 |
+
2. **XMem (Video Object Segmentation)** - Propagates mask across video frames
|
| 45 |
+
- Long-term memory with sensory, working, and long-term stores
|
| 46 |
+
- Handles occlusion, appearance changes, and fast motion
|
| 47 |
+
- Configurable memory management for edge deployment
|
| 48 |
+
|
| 49 |
+
## Exported Formats
|
| 50 |
+
|
| 51 |
+
| Format | File | Size | Use Case |
|
| 52 |
+
|--------|------|------|----------|
|
| 53 |
+
| SafeTensors | `pytorch/sam_vit_h_v1.safetensors` | 2564 MB | SAM vit_h - safe, fast loading |
|
| 54 |
+
| SafeTensors | `pytorch/sam_vit_b_v1.safetensors` | 375 MB | SAM vit_b - edge deployment |
|
| 55 |
+
| SafeTensors | `pytorch/xmem_v1.safetensors` | 249 MB | XMem tracker - safe format |
|
| 56 |
+
| TorchScript | `pytorch/sam_vit_h_encoder.pt` | 2431 MB | SAM vit_h encoder - JIT compiled |
|
| 57 |
+
| TorchScript | `pytorch/sam_vit_b_encoder.pt` | 342 MB | SAM vit_b encoder - JIT compiled |
|
| 58 |
+
| Frozen Bundle | `pytorch/xmem_frozen.pt` | 238 MB | XMem - frozen inference bundle |
|
| 59 |
+
|
| 60 |
+
## GPU Benchmark Results (NVIDIA L4, 22GB)
|
| 61 |
+
|
| 62 |
+
### Segmentation (Single Frame)
|
| 63 |
+
|
| 64 |
+
| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
|
| 65 |
+
|-----------|-----------|-------------|------------|-----------|
|
| 66 |
+
| vit_h | 640x480 | 1125.6 ms | 0.9 fps | 5977 MB |
|
| 67 |
+
| vit_b | 640x480 | 324.1 ms | 3.1 fps | 3008 MB |
|
| 68 |
+
| vit_b | 1280x720 | 353.8 ms | 2.8 fps | 3007 MB |
|
| 69 |
+
|
| 70 |
+
### Tracking (24 Frames, SAM + XMem)
|
| 71 |
+
|
| 72 |
+
| SAM Model | Resolution | Mean Latency | Throughput | Peak VRAM |
|
| 73 |
+
|-----------|-----------|-------------|------------|-----------|
|
| 74 |
+
| vit_h | 640x480 | 87.2 ms/f | 1.9 fps | 18.4 GB |
|
| 75 |
+
| vit_b | 640x480 | 89.4 ms/f | 4.8 fps | 16.3 GB |
|
| 76 |
+
| vit_b | 320x240 | 67.6 ms/f | 5.3 fps | 4.5 GB |
|
| 77 |
+
|
| 78 |
+
> vit_b at 640x480 is the recommended configuration for NVIDIA L4 GPUs.
|
| 79 |
+
> 1280x720 causes OOM on 22GB GPUs. Use 24GB+ for HD tracking.
|
| 80 |
+
|
| 81 |
+
## Quick Start
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
import torch
|
| 85 |
+
from safetensors.torch import load_file
|
| 86 |
+
|
| 87 |
+
# Load SAM vit_b weights (recommended for edge)
|
| 88 |
+
sam_state = load_file("pytorch/sam_vit_b_v1.safetensors")
|
| 89 |
+
|
| 90 |
+
# Load XMem weights
|
| 91 |
+
xmem_state = load_file("pytorch/xmem_v1.safetensors")
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Full Pipeline Usage
|
| 95 |
+
|
| 96 |
+
```bash
|
| 97 |
+
# Install MONAD
|
| 98 |
+
git clone https://github.com/RobotFlow-Labs/project_monad
|
| 99 |
+
cd project_monad
|
| 100 |
+
uv sync --extra gpu
|
| 101 |
+
|
| 102 |
+
# Download required base weights separately:
|
| 103 |
+
# - SAM vit_h: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
|
| 104 |
+
# - SAM vit_b: https://dl.fbaipublicfiles.com/segment_anything/sam_vit_b_01ec64.pth
|
| 105 |
+
# - XMem: https://github.com/hkchengrex/XMem/releases
|
| 106 |
+
|
| 107 |
+
# Run inference
|
| 108 |
+
CUDA_VISIBLE_DEVICES=0 uv run python scripts/run_baseline_inference.py \
|
| 109 |
+
--config configs/gpu_server.toml \
|
| 110 |
+
--input-path your_video.mp4 \
|
| 111 |
+
--output-dir output/ \
|
| 112 |
+
--max-frames 24 --point 320 240
|
| 113 |
+
|
| 114 |
+
# Start API server
|
| 115 |
+
uv run python -m monad.server
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
## Hardware Requirements
|
| 119 |
+
|
| 120 |
+
| Deployment | Min VRAM | Recommended | Max Frames |
|
| 121 |
+
|-----------|----------|-------------|------------|
|
| 122 |
+
| Edge (vit_b) | 6 GB | 8 GB | ~15 frames |
|
| 123 |
+
| Standard (vit_b) | 16 GB | 24 GB | ~30 frames |
|
| 124 |
+
| Full (vit_h) | 18 GB | 24 GB | ~30 frames |
|
| 125 |
+
| Extended | 48 GB+ | 80 GB | 100+ frames |
|
| 126 |
+
|
| 127 |
+
CPU fallback is functional but ~5x slower than GPU inference.
|
| 128 |
+
|
| 129 |
+
## Repository Contents
|
| 130 |
+
|
| 131 |
+
```
|
| 132 |
+
pytorch/ - Model weights (SafeTensors + TorchScript)
|
| 133 |
+
configs/ - Training & inference configurations
|
| 134 |
+
paper_mvp.toml - Paper-faithful MVP config
|
| 135 |
+
gpu_server.toml - GPU server optimized config
|
| 136 |
+
logs/ - Validation results & benchmarks
|
| 137 |
+
gpu_validation_results.json - Full GPU benchmark data
|
| 138 |
+
inference_runs/ - Sample inference outputs
|
| 139 |
+
crossing_objects/ - Demo masks + overlay video
|
| 140 |
+
frozen_models_manifest.json - Export manifest with checksums
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
## Dual Backend Support
|
| 144 |
+
|
| 145 |
+
MONAD supports both CUDA (GPU server / Jetson) and MLX (Apple Silicon) backends:
|
| 146 |
+
- **CUDA**: bf16 mixed precision, TorchScript exports
|
| 147 |
+
- **MLX**: fp32, native Apple GPU acceleration
|
| 148 |
+
- **CPU**: Fallback for any platform
|
| 149 |
+
|
| 150 |
+
Backend is auto-detected at runtime: CUDA > MLX > CPU.
|
| 151 |
+
|
| 152 |
+
## Validation
|
| 153 |
+
|
| 154 |
+
- **71/74 tests pass** on CUDA (3 failures are macOS-only MLX tests)
|
| 155 |
+
- **82% code coverage**
|
| 156 |
+
- **67 tests pass** on macOS with MLX backend
|
| 157 |
+
- Inference validated on NVIDIA L4 (22GB) with 3 demo videos
|
| 158 |
+
|
| 159 |
+
## Citation
|
| 160 |
+
|
| 161 |
+
```bibtex
|
| 162 |
+
@article{monad2025,
|
| 163 |
+
title={IST-ROS: A flexible object segmentation and tracking framework for robotics applications},
|
| 164 |
+
journal={SoftwareX},
|
| 165 |
+
year={2025},
|
| 166 |
+
publisher={Elsevier}
|
| 167 |
+
}
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
## License
|
| 171 |
+
|
| 172 |
+
Apache 2.0 - Robot Flow Labs / AIFLOW LABS LIMITED
|
configs/gpu_server.toml
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# MONAD GPU Server Configuration
|
| 2 |
+
# 8x NVIDIA L4 (23GB each) — inference validation
|
| 3 |
+
|
| 4 |
+
[model]
|
| 5 |
+
backend = "sam_xmem"
|
| 6 |
+
device = "cuda"
|
| 7 |
+
cache_dir = "/mnt/forge-data/.hf_cache"
|
| 8 |
+
|
| 9 |
+
[segmentation]
|
| 10 |
+
mask_confidence_threshold = 0.90
|
| 11 |
+
binary_threshold = 0.50
|
| 12 |
+
quality_threshold = 0.85
|
| 13 |
+
stability_threshold = 0.80
|
| 14 |
+
default_point_radius = 24
|
| 15 |
+
return_single_mask = true
|
| 16 |
+
|
| 17 |
+
[sam]
|
| 18 |
+
model_type = "vit_h"
|
| 19 |
+
checkpoint = "/mnt/forge-data/models/sam_vit_h_4b8939.pth"
|
| 20 |
+
|
| 21 |
+
[xmem]
|
| 22 |
+
checkpoint = "/mnt/forge-data/models/XMem.pth"
|
| 23 |
+
buffer_size = 50
|
| 24 |
+
num_objects = 1
|
| 25 |
+
max_mid_term_frames = 10
|
| 26 |
+
min_mid_term_frames = 5
|
| 27 |
+
max_long_term_elements = 100000
|
| 28 |
+
num_prototypes = 128
|
| 29 |
+
top_k = 30
|
| 30 |
+
mem_every = 10
|
| 31 |
+
deep_update_every = -1
|
| 32 |
+
enable_long_term = true
|
| 33 |
+
enable_long_term_count_usage = true
|
| 34 |
+
size = -1
|
| 35 |
+
key_dim = 64
|
| 36 |
+
value_dim = 512
|
| 37 |
+
hidden_dim = 64
|
| 38 |
+
single_object = false
|
| 39 |
+
|
| 40 |
+
[mlx]
|
| 41 |
+
enabled = false
|
| 42 |
+
mode = "segment_anything"
|
| 43 |
+
examples_dir = ".cache/monad/mlx-examples"
|
| 44 |
+
sam_model = "vit_b"
|
| 45 |
+
checkpoint = ""
|
| 46 |
+
max_image_side = 1024
|
| 47 |
+
prefer_unified_memory = false
|
| 48 |
+
|
| 49 |
+
[tracking]
|
| 50 |
+
tracker_type = "single_object"
|
| 51 |
+
max_tracked_objects = 8
|
| 52 |
+
tracking_memory_frames = 30
|
| 53 |
+
confidence_decay = 0.98
|
| 54 |
+
|
| 55 |
+
[inference]
|
| 56 |
+
batch_size = 1
|
| 57 |
+
max_batch_size = 4
|
| 58 |
+
timeout_segmentation = 30.0
|
| 59 |
+
timeout_tracking = 60.0
|
| 60 |
+
|
| 61 |
+
[logging]
|
| 62 |
+
level = "INFO"
|
| 63 |
+
format = "json"
|
| 64 |
+
structured = true
|
| 65 |
+
|
| 66 |
+
[metrics]
|
| 67 |
+
enable_metrics = true
|
| 68 |
+
metrics_port = 9090
|
| 69 |
+
track_inference_time = true
|
| 70 |
+
track_memory_usage = true
|
| 71 |
+
|
| 72 |
+
[api]
|
| 73 |
+
host = "0.0.0.0"
|
| 74 |
+
port = 8000
|
| 75 |
+
workers = 1
|
| 76 |
+
reload = false
|
| 77 |
+
cors_origins = ["http://localhost:8000"]
|
| 78 |
+
cors_allow_credentials = false
|
| 79 |
+
cors_allow_methods = ["GET", "POST"]
|
| 80 |
+
cors_allow_headers = ["Content-Type"]
|
| 81 |
+
|
| 82 |
+
[grpc]
|
| 83 |
+
enabled = false
|
| 84 |
+
port = 50051
|
| 85 |
+
|
| 86 |
+
[storage]
|
| 87 |
+
models_cache_dir = "/mnt/forge-data/.hf_cache"
|
| 88 |
+
temp_dir = "/tmp/monad"
|
| 89 |
+
artifacts_dir = "/mnt/artifacts-datai/models/project_monad"
|
| 90 |
+
max_cache_size_gb = 50
|
| 91 |
+
run_retention_max_runs = 25
|
| 92 |
+
run_retention_max_age_days = 30
|
| 93 |
+
|
| 94 |
+
[features]
|
| 95 |
+
enable_rest_api = true
|
| 96 |
+
enable_grpc = false
|
| 97 |
+
enable_websocket = false
|
| 98 |
+
enable_batch_processing = true
|
| 99 |
+
enable_stream_processing = false
|
configs/paper_mvp.toml
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[model]
|
| 2 |
+
backend = "sam_xmem"
|
| 3 |
+
device = "auto"
|
| 4 |
+
cache_dir = ".cache/monad"
|
| 5 |
+
|
| 6 |
+
[segmentation]
|
| 7 |
+
mask_confidence_threshold = 0.90
|
| 8 |
+
binary_threshold = 0.50
|
| 9 |
+
quality_threshold = 0.85
|
| 10 |
+
stability_threshold = 0.80
|
| 11 |
+
default_point_radius = 24
|
| 12 |
+
return_single_mask = true
|
| 13 |
+
|
| 14 |
+
[sam]
|
| 15 |
+
model_type = "vit_h"
|
| 16 |
+
checkpoint = "data/weights/sam_vit_h_4b8939.pth"
|
| 17 |
+
|
| 18 |
+
[xmem]
|
| 19 |
+
checkpoint = "data/weights/XMem.pth"
|
| 20 |
+
buffer_size = 50
|
| 21 |
+
num_objects = 1
|
| 22 |
+
max_mid_term_frames = 10
|
| 23 |
+
min_mid_term_frames = 5
|
| 24 |
+
max_long_term_elements = 100000
|
| 25 |
+
num_prototypes = 128
|
| 26 |
+
top_k = 30
|
| 27 |
+
mem_every = 10
|
| 28 |
+
deep_update_every = -1
|
| 29 |
+
enable_long_term = true
|
| 30 |
+
enable_long_term_count_usage = true
|
| 31 |
+
size = -1
|
| 32 |
+
key_dim = 64
|
| 33 |
+
value_dim = 512
|
| 34 |
+
hidden_dim = 64
|
| 35 |
+
single_object = false
|
| 36 |
+
|
| 37 |
+
[mlx]
|
| 38 |
+
enabled = false
|
| 39 |
+
mode = "segment_anything"
|
| 40 |
+
examples_dir = ".cache/monad/mlx-examples"
|
| 41 |
+
sam_model = "vit_b"
|
| 42 |
+
checkpoint = "data/weights/sam_vit_b_01ec64.pth"
|
| 43 |
+
max_image_side = 1024
|
| 44 |
+
prefer_unified_memory = true
|
| 45 |
+
|
| 46 |
+
[tracking]
|
| 47 |
+
tracker_type = "single_object"
|
| 48 |
+
max_tracked_objects = 8
|
| 49 |
+
tracking_memory_frames = 30
|
| 50 |
+
confidence_decay = 0.98
|
| 51 |
+
|
| 52 |
+
[inference]
|
| 53 |
+
batch_size = 1
|
| 54 |
+
max_batch_size = 4
|
| 55 |
+
timeout_segmentation = 30.0
|
| 56 |
+
timeout_tracking = 60.0
|
| 57 |
+
|
| 58 |
+
[logging]
|
| 59 |
+
level = "INFO"
|
| 60 |
+
format = "json"
|
| 61 |
+
structured = true
|
| 62 |
+
|
| 63 |
+
[metrics]
|
| 64 |
+
enable_metrics = true
|
| 65 |
+
metrics_port = 9090
|
| 66 |
+
track_inference_time = true
|
| 67 |
+
track_memory_usage = true
|
| 68 |
+
|
| 69 |
+
[api]
|
| 70 |
+
host = "0.0.0.0"
|
| 71 |
+
port = 8000
|
| 72 |
+
workers = 1
|
| 73 |
+
reload = false
|
| 74 |
+
cors_origins = ["http://localhost:3000", "http://localhost:8000"]
|
| 75 |
+
cors_allow_credentials = false
|
| 76 |
+
cors_allow_methods = ["GET", "POST", "DELETE", "OPTIONS"]
|
| 77 |
+
cors_allow_headers = ["Content-Type", "Authorization"]
|
| 78 |
+
|
| 79 |
+
[grpc]
|
| 80 |
+
enabled = false
|
| 81 |
+
port = 50051
|
| 82 |
+
|
| 83 |
+
[storage]
|
| 84 |
+
models_cache_dir = ".cache/monad/models"
|
| 85 |
+
temp_dir = "/tmp/monad"
|
| 86 |
+
artifacts_dir = "artifacts"
|
| 87 |
+
max_cache_size_gb = 50
|
| 88 |
+
run_retention_max_runs = 25
|
| 89 |
+
run_retention_max_age_days = 30
|
| 90 |
+
|
| 91 |
+
[features]
|
| 92 |
+
enable_rest_api = true
|
| 93 |
+
enable_grpc = false
|
| 94 |
+
enable_websocket = false
|
| 95 |
+
enable_batch_processing = true
|
| 96 |
+
enable_stream_processing = false
|
frozen_models_manifest.json
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"timestamp": "2026-03-21T09:07:38",
|
| 3 |
+
"gpu": "NVIDIA L4",
|
| 4 |
+
"torch_version": "2.6.0+cu124",
|
| 5 |
+
"cuda_version": "12.4",
|
| 6 |
+
"exports": {
|
| 7 |
+
"sam_vit_h_encoder": {
|
| 8 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_h_encoder.pt",
|
| 9 |
+
"size_mb": 2430.8,
|
| 10 |
+
"format": "torchscript",
|
| 11 |
+
"status": "ok"
|
| 12 |
+
},
|
| 13 |
+
"sam_vit_h_state_dict": {
|
| 14 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_h_state_dict.pt",
|
| 15 |
+
"size_mb": 2445.8,
|
| 16 |
+
"format": "state_dict",
|
| 17 |
+
"status": "ok"
|
| 18 |
+
},
|
| 19 |
+
"sam_vit_b_encoder": {
|
| 20 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_b_encoder.pt",
|
| 21 |
+
"size_mb": 342.3,
|
| 22 |
+
"format": "torchscript",
|
| 23 |
+
"status": "ok"
|
| 24 |
+
},
|
| 25 |
+
"sam_vit_b_state_dict": {
|
| 26 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/sam_vit_b_state_dict.pt",
|
| 27 |
+
"size_mb": 357.7,
|
| 28 |
+
"format": "state_dict",
|
| 29 |
+
"status": "ok"
|
| 30 |
+
},
|
| 31 |
+
"xmem_state_dict": {
|
| 32 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/xmem_state_dict.pt",
|
| 33 |
+
"size_mb": 237.5,
|
| 34 |
+
"format": "state_dict",
|
| 35 |
+
"status": "ok"
|
| 36 |
+
},
|
| 37 |
+
"xmem_frozen": {
|
| 38 |
+
"path": "/mnt/artifacts-datai/exports/project_monad/xmem_frozen.pt",
|
| 39 |
+
"size_mb": 237.5,
|
| 40 |
+
"format": "frozen_bundle",
|
| 41 |
+
"status": "ok"
|
| 42 |
+
}
|
| 43 |
+
}
|
| 44 |
+
}
|
inference_runs/crossing_objects/masks/mask_0000.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0001.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0002.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0003.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0004.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0005.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0006.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0007.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0008.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0009.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0010.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0011.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0012.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0013.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0014.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0015.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0016.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0017.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0018.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0019.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0020.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0021.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0022.png
ADDED
|
inference_runs/crossing_objects/masks/mask_0023.png
ADDED
|
inference_runs/crossing_objects/metrics.json
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backend": "sam_xmem",
|
| 3 |
+
"requested_device": "cuda",
|
| 4 |
+
"resolved_device": "cuda",
|
| 5 |
+
"backend_ready": true,
|
| 6 |
+
"backend_notes": [
|
| 7 |
+
"Research backend available"
|
| 8 |
+
],
|
| 9 |
+
"frames_processed": 24,
|
| 10 |
+
"latency_ms": {
|
| 11 |
+
"mean": 84.62345065214797,
|
| 12 |
+
"max": 125.68026799999643
|
| 13 |
+
},
|
| 14 |
+
"tracker_id": "trk_b03b9f3248",
|
| 15 |
+
"object_id": "obj_ac9a2a710a",
|
| 16 |
+
"frame_limit": 24
|
| 17 |
+
}
|
inference_runs/crossing_objects/overlay.mp4
ADDED
|
Binary file (33.1 kB). View file
|
|
|
logs/gpu_validation_results.json
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"timestamp": "2026-03-21T08:44:14",
|
| 3 |
+
"cuda_devices": 2,
|
| 4 |
+
"gpu_names": [
|
| 5 |
+
"NVIDIA L4",
|
| 6 |
+
"NVIDIA L4"
|
| 7 |
+
],
|
| 8 |
+
"model_loading": {
|
| 9 |
+
"vram_before": {
|
| 10 |
+
"gpu_0": {
|
| 11 |
+
"allocated_mb": 0.0,
|
| 12 |
+
"reserved_mb": 0.0,
|
| 13 |
+
"total_mb": 22563.1,
|
| 14 |
+
"utilization_pct": 0.0
|
| 15 |
+
},
|
| 16 |
+
"gpu_1": {
|
| 17 |
+
"allocated_mb": 0.0,
|
| 18 |
+
"reserved_mb": 0.0,
|
| 19 |
+
"total_mb": 22563.1,
|
| 20 |
+
"utilization_pct": 0.0
|
| 21 |
+
}
|
| 22 |
+
},
|
| 23 |
+
"vram_after": {
|
| 24 |
+
"gpu_0": {
|
| 25 |
+
"allocated_mb": 2695.1,
|
| 26 |
+
"reserved_mb": 2732.0,
|
| 27 |
+
"total_mb": 22563.1,
|
| 28 |
+
"utilization_pct": 11.9
|
| 29 |
+
},
|
| 30 |
+
"gpu_1": {
|
| 31 |
+
"allocated_mb": 0.0,
|
| 32 |
+
"reserved_mb": 0.0,
|
| 33 |
+
"total_mb": 22563.1,
|
| 34 |
+
"utilization_pct": 0.0
|
| 35 |
+
}
|
| 36 |
+
},
|
| 37 |
+
"load_time_s": 12.648,
|
| 38 |
+
"total_init_time_s": 12.755,
|
| 39 |
+
"backend": "sam_xmem",
|
| 40 |
+
"device_requested": "cuda",
|
| 41 |
+
"device_resolved": "cuda",
|
| 42 |
+
"sam_model_type": "vit_h",
|
| 43 |
+
"sam_checkpoint": "/mnt/forge-data/models/sam_vit_h_4b8939.pth",
|
| 44 |
+
"xmem_checkpoint": "/mnt/forge-data/models/XMem.pth"
|
| 45 |
+
},
|
| 46 |
+
"inference_benchmarks": [
|
| 47 |
+
{
|
| 48 |
+
"video": "crossing_objects.mp4",
|
| 49 |
+
"resolution": "640x480",
|
| 50 |
+
"video_fps": 24.0,
|
| 51 |
+
"total_video_frames": 36,
|
| 52 |
+
"frames_processed": 24,
|
| 53 |
+
"max_frames_limit": 24,
|
| 54 |
+
"latency_mean_ms": 84.62,
|
| 55 |
+
"latency_max_ms": 125.68,
|
| 56 |
+
"total_time_s": 3.479,
|
| 57 |
+
"peak_vram_mb": 18250.4,
|
| 58 |
+
"throughput_fps": 6.9,
|
| 59 |
+
"backend": "sam_xmem",
|
| 60 |
+
"device": "cuda",
|
| 61 |
+
"tracker_id": "trk_b03b9f3248",
|
| 62 |
+
"object_id": "obj_ac9a2a710a"
|
| 63 |
+
},
|
| 64 |
+
{
|
| 65 |
+
"video": "moving_circle.mp4",
|
| 66 |
+
"resolution": "640x480",
|
| 67 |
+
"video_fps": 24.0,
|
| 68 |
+
"total_video_frames": 48,
|
| 69 |
+
"frames_processed": 24,
|
| 70 |
+
"max_frames_limit": 24,
|
| 71 |
+
"latency_mean_ms": 82.75,
|
| 72 |
+
"latency_max_ms": 103.4,
|
| 73 |
+
"total_time_s": 3.31,
|
| 74 |
+
"peak_vram_mb": 18250.4,
|
| 75 |
+
"throughput_fps": 7.25,
|
| 76 |
+
"backend": "sam_xmem",
|
| 77 |
+
"device": "cuda",
|
| 78 |
+
"tracker_id": "trk_6c1a8b0c67",
|
| 79 |
+
"object_id": "obj_1bb04c1f69"
|
| 80 |
+
},
|
| 81 |
+
{
|
| 82 |
+
"video": "static_scene.mp4",
|
| 83 |
+
"resolution": "640x480",
|
| 84 |
+
"video_fps": 24.0,
|
| 85 |
+
"total_video_frames": 12,
|
| 86 |
+
"frames_processed": 12,
|
| 87 |
+
"max_frames_limit": 24,
|
| 88 |
+
"latency_mean_ms": 73.22,
|
| 89 |
+
"latency_max_ms": 94.14,
|
| 90 |
+
"total_time_s": 1.994,
|
| 91 |
+
"peak_vram_mb": 10417.9,
|
| 92 |
+
"throughput_fps": 6.02,
|
| 93 |
+
"backend": "sam_xmem",
|
| 94 |
+
"device": "cuda",
|
| 95 |
+
"tracker_id": "trk_fb592fd895",
|
| 96 |
+
"object_id": "obj_527b82ec72"
|
| 97 |
+
},
|
| 98 |
+
{
|
| 99 |
+
"video": "synthetic_stress_test.mp4",
|
| 100 |
+
"error": "OOM at 48 frames: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.03 GiB of which 5.",
|
| 101 |
+
"note": "SAM vit_h + XMem exceeds 22GB L4 at this frame count"
|
| 102 |
+
}
|
| 103 |
+
],
|
| 104 |
+
"final_vram": {
|
| 105 |
+
"gpu_0": {
|
| 106 |
+
"allocated_mb": 22248.9,
|
| 107 |
+
"reserved_mb": 22318.0,
|
| 108 |
+
"total_mb": 22563.1,
|
| 109 |
+
"utilization_pct": 98.6
|
| 110 |
+
},
|
| 111 |
+
"gpu_1": {
|
| 112 |
+
"allocated_mb": 0.0,
|
| 113 |
+
"reserved_mb": 0.0,
|
| 114 |
+
"total_mb": 22563.1,
|
| 115 |
+
"utilization_pct": 0.0
|
| 116 |
+
}
|
| 117 |
+
}
|
| 118 |
+
}
|
pytorch/sam_vit_b_encoder.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3dd0a2dddd7ba744d9638035feaf065cd85f02855bb1a09e394610cf7f08d756
|
| 3 |
+
size 358970620
|
pytorch/sam_vit_b_v1.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b2d6ca146991a3484ecca745be30df7e142aa136296b9fbb740951b09aa85ab9
|
| 3 |
+
size 374979208
|
pytorch/sam_vit_h_encoder.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1389261ae8fbc6c6770017d3561aa6666985701a8d412333d93a4228b7ede683
|
| 3 |
+
size 2548843796
|
pytorch/sam_vit_h_v1.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:298e1126d179717313e8781b55c7d044f60f0ee40d8316d5c3438ac835346070
|
| 3 |
+
size 2564431552
|
pytorch/xmem_frozen.pt
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:11630d50f4a0058d4e62905e0f461b4b072871e38edc839ced1f47a89a935592
|
| 3 |
+
size 249028922
|
pytorch/xmem_v1.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:76faa1d6fe3a4726a7d4f5e81d8e6e7e88a8669cc5df524f52f540232b6f0bb5
|
| 3 |
+
size 248924912
|