Spaces:

purkrmir
/

BBoxMaskPose-demo

Running

App Files Files Community

Miroslav Purkrabek commited on Mar 4

Commit

322535b

1 Parent(s): e5057aa

add BMPv2 code

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

BMPv2_README.md +311 -0
SAM3D_INTEGRATION.md +302 -0
app.py +188 -172
bboxmaskpose/__init__.py +10 -0
bboxmaskpose/api.py +515 -0
{configs → bboxmaskpose/configs}/README.md +0 -0
{configs → bboxmaskpose/configs}/bmp_D3.yaml +9 -2
{configs → bboxmaskpose/configs}/bmp_J1.yaml +5 -0
bboxmaskpose/configs/bmp_v2.yaml +34 -0
{demo → bboxmaskpose}/demo_utils.py +30 -110
{demo → bboxmaskpose}/posevis_lite.py +12 -12
{sam2 → bboxmaskpose/sam2}/__init__.py +1 -1
{sam2 → bboxmaskpose/sam2}/automatic_mask_generator.py +20 -50
{sam2 → bboxmaskpose/sam2}/benchmark.py +3 -9
{sam2 → bboxmaskpose/sam2}/build_sam.py +34 -9
{sam2 → bboxmaskpose/sam2}/colorblind.py +8 -16
bboxmaskpose/sam2/configs/sam-pose2seg/sam-pose2seg_hiera_b+.yaml +118 -0
{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_b+.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_l.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_s.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_t.yaml +14 -14
bboxmaskpose/sam2/configs/sam2.1_training/sam2.1_hiera_b+_COCO+CIHP_finetune_sam-pose2seg.yaml +343 -0
{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_1024_prompt.yaml +15 -23
{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_finetune.yaml +15 -24
{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_finetune_prompt+decoder.yaml +15 -24
{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_MOSE_finetune.yaml +15 -21
{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_b+.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_l.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_s.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_t.yaml +14 -14
{sam2 → bboxmaskpose/sam2}/csrc/connected_components.cu +0 -0
{sam2 → bboxmaskpose/sam2}/distinctipy.py +7 -14
{sam2 → bboxmaskpose/sam2}/modeling/__init__.py +0 -0
{sam2 → bboxmaskpose/sam2}/modeling/backbones/__init__.py +0 -0
{sam2 → bboxmaskpose/sam2}/modeling/backbones/hieradet.py +10 -31
{sam2 → bboxmaskpose/sam2}/modeling/backbones/image_encoder.py +1 -3
{sam2 → bboxmaskpose/sam2}/modeling/backbones/utils.py +2 -6
{sam2 → bboxmaskpose/sam2}/modeling/memory_attention.py +4 -7
{sam2 → bboxmaskpose/sam2}/modeling/memory_encoder.py +3 -9
{sam2 → bboxmaskpose/sam2}/modeling/position_encoding.py +8 -31
{sam2 → bboxmaskpose/sam2}/modeling/sam/__init__.py +0 -0
{sam2 → bboxmaskpose/sam2}/modeling/sam/mask_decoder.py +11 -32
{sam2 → bboxmaskpose/sam2}/modeling/sam/pose_encoder.py +7 -19
{sam2 → bboxmaskpose/sam2}/modeling/sam/prompt_encoder.py +21 -26
{sam2 → bboxmaskpose/sam2}/modeling/sam/transformer.py +12 -30
{sam2 → bboxmaskpose/sam2}/modeling/sam2_base.py +72 -246
{sam2 → bboxmaskpose/sam2}/modeling/sam2_base_pose.py +45 -87
{sam2 → bboxmaskpose/sam2}/modeling/sam2_utils.py +5 -13
{sam2 → bboxmaskpose/sam2}/sam2_image_predictor.py +32 -81
{sam2 → bboxmaskpose/sam2}/sam2_video_predictor.py +35 -298

BMPv2_README.md ADDED Viewed

	@@ -0,0 +1,311 @@

+</h1><div id="toc">
+  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
+    <summary>
+      <h1 style="margin-bottom: 0.0em;">
+        BBoxMaskPose v2
+      </h1>
+    </summary>
+  </ul>
+</div>
+</h1><div id="toc">
+  <ul align="center" style="list-style: none; padding: 0; margin: 0;">
+    <summary>
+      <h2 style="margin-bottom: 0.2em;">
+        CVPR 2025 + ICCV 2025
+      </h2>
+    </summary>
+  </ul>
+</div>
+<div align="center">
+  <img src="data/assets/BMP_043+076+174.gif" alt="BBoxMaskPose v2 loop" height="500px">
+  [![Website](https://img.shields.io/badge/Website-BBoxMaskPose-green)](https://mirapurkrabek.github.io/BBox-Mask-Pose/) &nbsp;&nbsp;&nbsp;
+  [![License](https://img.shields.io/badge/License-GPL%203.0-orange.svg)](LICENSE) &nbsp;&nbsp;&nbsp;
+  [![Video](https://img.shields.io/badge/Video-YouTube-red?logo=youtube)](https://youtu.be/U05yUP4b2LQ)
+  [![Paper](https://img.shields.io/badge/ProbPose-CVPR%202025-blue)](https://arxiv.org/abs/2412.02254) &nbsp;&nbsp;&nbsp;
+  [![Paper](https://img.shields.io/badge/BMPv1-ICCV%202025-blue)](https://arxiv.org/abs/2412.01562) &nbsp;&nbsp;&nbsp;
+  [![Paper](https://img.shields.io/badge/SAMpose2seg-CVWW%202026-blue)](https://arxiv.org/abs/2601.08982) &nbsp;&nbsp;&nbsp;
+  [![Paper](https://img.shields.io/badge/BMPv2-arXiv-blue)](https://arxiv.org/abs/2601.15200) &nbsp;&nbsp;&nbsp;
+  <!-- Papers with code:
+  [![2D Pose AP on OCHuman: 42.5](https://img.shields.io/badge/OCHuman-2D_Pose:_49.2_AP-blue)](https://paperswithcode.com/sota/2d-human-pose-estimation-on-ochuman?p=detection-pose-estimation-and-segmentation-1) &nbsp;&nbsp;
+  [![Human Instance Segmentation AP on OCHuman: 34.0](https://img.shields.io/badge/OCHuman-Human_Instance_Segmentation:_34.0_AP-blue)](https://paperswithcode.com/sota/human-instance-segmentation-on-ochuman?p=detection-pose-estimation-and-segmentation-1)   -->
+</div>
+> [!CAUTION]
+> This branch is a **work in progress**!
+>
+> Until merged with <code>main</code>, use on your own discretion. For stable version, please refer to <code>main</code> branch with BMPv1.
+## 📢 News
+- **Feb 2026**: Version 2.0 with improved (1) pose and (2) SAM and (3) wiring to 3D prediction released.
+- **Feb 2026**: SAM-pose2seg won a Best Paper Award on CVWW 2026 🎉
+- **Jan 2026**: [BMPv2 paper](https://arxiv.org/abs/2601.15200) is available on arXiv
+- **Aug 2025**: [HuggingFace Image Demo](https://huggingface.co/spaces/purkrmir/BBoxMaskPose-demo) is out! 🎮
+- **Jul 2025**: Version 1.1 with easy-to-run image demo released
+- **Jun 2025**: BMPv1 paper accepted to ICCV 2025! 🎉
+- **Dec 2024**: BMPv1 code is available
+- **Nov 2024**: The [project website](https://MiraPurkrabek.github.io/BBox-Mask-Pose) is on
+## 📑 Table of Contents
+- [Installation](#-installation)
+- [Demo](#-demo)
+- [API Examples](#api-examples)
+- [Pre-trained Models](#-pre-trained-models)
+- [Acknowledgments](#-acknowledgments)
+- [Citation](#-citation)
+## 📋 Project Overview
+Bounding boxes, masks, and poses capture complementary aspects of the human body. BBoxMaskPose links detection, segmentation, and pose estimation iteratively, where each prediction refines the others. PMPose combines probabilistic modeling with mask conditioning for robust pose estimation in crowds. Together, these components achieve state-of-the-art results on COCO and OCHuman, being the first method to exceed 50 AP on OCHuman.
+### Repository Structure
+The repository is organized into two main packages with stable public APIs:
+```
+BBoxMaskPose/
+├── pmpose/                    # PMPose package (pose estimation)
+│   └── pmpose/
+│       ├── api.py             # PUBLIC API: PMPose class
+│       ├── mm_utils.py        # Internal utilities
+│       └── posevis_lite.py    # Visualization
+├── mmpose/                    # MMPose fork with our edits
+├── bboxmaskpose/              # BBoxMaskPose package (full pipeline)
+│   └── bboxmaskpose/
+│       ├── api.py             # PUBLIC API: BBoxMaskPose class
+│       ├── sam2/              # SAM2 implementation
+│       ├── configs/           # BMP configurations
+│       └── *_utils.py         # Internal utilities
+├── demos/                     # Public API demos
+│   ├── PMPose_demo.py         # PMPose usage example
+│   ├── BMP_demo.py            # BBoxMaskPose usage example
+│   └── quickstart.ipynb       # Interactive notebook
+└── demo/                      # Legacy demo (still functional)
+```
+Key contributions:
+1. **MaskPose**: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
+    - Download pre-trained weights below
+2. **BBox-MaskPose (BMP)**: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation
+    - Try the demo!
+3. Fine-tuned RTMDet adapted for itterative detection (ignoring 'holes')
+    - Download pre-trained weights below
+4. Support for multi-dataset training of ViTPose, previously implemented in the official ViTPose repository but absent in MMPose.
+For more details, please visit our [project website](https://mirapurkrabek.github.io/BBox-Mask-Pose/).
+## 🚀 Installation
+### Docker Installation (Recommended)
+The fastest way to get started with GPU support:
+```bash
+# Clone and build
+git clone https://github.com/mirapurkrabek/BBoxMaskPose.git
+cd BBoxMaskPose
+docker-compose build
+# Run the demo
+docker-compose up
+```
+Requires: Docker Engine 19.03+, [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html), NVIDIA GPU with CUDA 12.1 support.
+### Manual Installation
+This project is built on top of [MMPose](https://github.com/open-mmlab/mmpose) and [SAM 2.1](https://github.com/facebookresearch/sam2).
+Please refer to the [MMPose installation guide](https://mmpose.readthedocs.io/en/latest/installation.html) or [SAM installation guide](https://github.com/facebookresearch/sam2/blob/main/INSTALL.md) for detailed setup instructions.
+Basic installation steps:
+```bash
+# Clone the repository
+git clone https://github.com/mirapurkrabek/BBoxMaskPose.git BBoxMaskPose/
+cd BBoxMaskPose
+# Install your version of torch, torchvision, OpenCV and NumPy
+pip install torch==2.1.2+cu121 torchvision==0.16.2+cu121 --extra-index-url https://download.pytorch.org/whl/cu121
+pip install numpy==1.25.1 opencv-python==4.9.0.80
+# Install MMLibrary
+pip install -U openmim
+mim install mmengine "mmcv==2.1.0" "mmdet==3.3.0" "mmpretrain==1.2.0"
+# Install dependencies
+pip install -r requirements.txt
+pip install -e .
+```
+## 🎮 Demo
+#### PMPose Demo (Pose Estimation Only)
+```bash
+python demos/PMPose_demo.py --image data/004806.jpg --device cuda
+```
+#### BBoxMaskPose Demo (Full Pipeline)
+```bash
+python demos/BMP_demo.py --image data/004806.jpg --device cuda
+```
+After running the demo, outputs are in `outputs/004806/`. The expected output should look like this:
+<div align="center">
+  <a href="data/assets/004806_mask.jpg" target="_blank">
+    <img src="data/assets/004806_mask.jpg" alt="Detection results" width="200" />
+  </a>
+  &nbsp&nbsp&nbsp&nbsp
+  <a href="data/assets/004806_pose.jpg" target="_blank">
+    <img src="data/assets/004806_pose.jpg" alt="Pose results" width="200" style="margin-right:10px;" />
+  </a>
+</div>
+#### BBoxMaskPose v2 Demo (Full Pipeline + 3D Mesh Recovery)
+This demo extends BMP with [SAM-3D-Body](https://github.com/facebookresearch/sam-3d-body) for 3D human mesh recovery:
+```bash
+# Basic usage (auto-downloads checkpoint from HuggingFace)
+python demos/BMPv2_demo.py --image data/004806.jpg --device cuda
+# With local checkpoint
+python demos/BMPv2_demo.py --image data/004806.jpg --device cuda \
+    --sam3d_checkpoint checkpoints/sam-3d-body-dinov3/model.ckpt \
+    --mhr_path checkpoints/sam-3d-body-dinov3/assets/mhr_model.pt
+```
+**SAM-3D-Body Installation (Optional):**
+BMPv2 requires SAM-3D-Body for 3D mesh recovery. Install it separately:
+```bash
+# 1. Install dependencies
+pip install -r requirements/sam3d.txt
+# 2. Install detectron2
+pip install 'git+https://github.com/facebookresearch/detectron2.git@a1ce2f9' --no-build-isolation --no-deps
+# 3. Install MoGe (optional, for FOV estimation)
+pip install git+https://github.com/microsoft/MoGe.git
+# 4. Install adapted SAM-3D-Body repository
+pip install git+https://github.com/MiraPurkrabek/sam-3d-body.git
+# 5. Request access to checkpoints at https://huggingface.co/facebook/sam-3d-body-dinov3
+```
+For more details, see [SAM-3D-Body installation guide](https://github.com/facebookresearch/sam-3d-body/blob/main/INSTALL.md).
+#### Jupyter Notebook
+Interactive demo with both PMPose and BBoxMaskPose:
+```bash
+jupyter notebook demos/quickstart.ipynb
+```
+## API Examples
+**PMPose API** - Pose estimation with bounding boxes:
+```python
+from pmpose import PMPose
+# Initialize model
+pose_model = PMPose(device="cuda", from_pretrained=True)
+# Run inference
+keypoints, presence, visibility, heatmaps = pose_model.predict(
+    image="demo/data/004806.jpg",
+    bboxes=[[100, 100, 300, 400]],  # [x1, y1, x2, y2]
+)
+# Visualize
+vis_img = pose_model.visualize(image="demo/data/004806.jpg", keypoints=keypoints)
+```
+**BBoxMaskPose API** - Full detection + pose + segmentation:
+```python
+from pmpose import PMPose
+from bboxmaskpose import BBoxMaskPose
+# Create pose model
+pose_model = PMPose(device="cuda", from_pretrained=True)
+# Inject into BMP
+bmp_model = BBoxMaskPose(config="BMP_D3", device="cuda", pose_model=pose_model)
+result = bmp_model.predict(image="demo/data/004806.jpg")
+# Visualize
+vis_img = bmp_model.visualize(image="demo/data/004806.jpg", result=result)
+```
+## 📦 Pre-trained Models
+Pre-trained models are available on [VRG Hugging Face 🤗](https://huggingface.co/vrg-prague/BBoxMaskPose/).
+To run the demo, you only need do download SAM weights with [enclosed script](models/SAM/download_ckpts.sh).
+Our detector and pose estimator will be downloaded during the runtime.
+If you want to download our weights yourself, here are the links to our HuggingFace:
+- ViTPose-b trained on COCO+MPII+AIC -- [download weights](https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/ViTPose-b-multi_mmpose20.pth)
+- MaskPose-b -- [download weights](https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/MaskPose-b.pth)
+- Fine-tuned RTMDet-L -- [download weights](https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/rtmdet-ins-l-mask.pth)
+## 🙏 Acknowledgments
+The code combines [MMDetection](https://github.com/open-mmlab/mmdetection), [MMPose 2.0](https://github.com/open-mmlab/mmpose), [ViTPose](https://github.com/ViTAE-Transformer/ViTPose), [SAM 2.1](https://github.com/facebookresearch/sam2) and [SAM-3D-Body](https://github.com/facebookresearch/sam-3d-body).
+Our visualizations integrate [Distinctipy](https://github.com/alan-turing-institute/distinctipy) for automatic color selection.
+This repository combines our work on BBoxMaskPose project with our previous work on [probabilistic 2D human pose estimation modelling](https://mirapurkrabek.github.io/ProbPose/).
+## 📝 Citation
+The code was implemented by [Miroslav Purkrábek](https://mirapurkrabek.github.io/) and Constantin Kolomiiets.
+If you use this work, kindly cite it using the references provided below.
+For questions, please use the Issues of Discussion.
+```
+@InProceedings{Purkrabek2025BMPv1,
+    author    = {Purkrabek, Miroslav and Matas, Jiri},
+    title     = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
+    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
+    month     = {October},
+    year      = {2025},
+    pages     = {9004-9013}
+}
+```
+```
+@InProceedings{Purkrabek2026BMPv2,
+    author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
+    title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
+    booktitle = {arXiv preprint arXiv:2601.15200},
+    year      = {2026}
+}
+```
+```
+@article{yang2025sam3dbody,
+  title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
+  author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
+  journal={arXiv preprint; identifier to be added},
+  year={2025}
+}
+```
+```
+@InProceedings{Kolomiiets2026CVWW,
+    author    = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
+    title     = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
+    booktitle = {Computer Vision Winter Workshop (CVWW)},
+    year      = {2026}
+}
+```

SAM3D_INTEGRATION.md ADDED Viewed

	@@ -0,0 +1,302 @@

+# SAM-3D-Body Integration Guide
+This guide explains how to integrate and use SAM-3D-Body for 3D human mesh recovery within the BBoxMaskPose pipeline.
+## Overview
+BBoxMaskPose v2 extends the original BMP pipeline with [SAM-3D-Body](https://github.com/facebookresearch/sam-3d-body) from Meta AI, enabling full 3D human mesh recovery from single images. The integration leverages BMP's high-quality 2D pose estimates and segmentation masks as prompts to SAM-3D-Body, resulting in accurate 3D reconstructions even in crowded scenes.
+**Pipeline Flow:**
+```
+Input Image
+    ↓
+BBoxMaskPose (Detection + 2D Pose + Segmentation)
+    ↓
+2D Bboxes + Masks + Poses
+    ↓
+SAM-3D-Body (3D Mesh Recovery)
+    ↓
+3D Human Meshes (vertices, joints, faces)
+```
+## Installation
+### Prerequisites
+- BBoxMaskPose must be already installed and working
+- CUDA-capable GPU recommended (CPU inference is very slow)
+- Python 3.8+ (Python 3.11 recommended for SAM-3D-Body)
+### Step 1: Install SAM-3D-Body Dependencies
+```bash
+# Navigate to BBoxMaskPose root directory
+cd /path/to/BBoxMaskPose
+# Install SAM-3D-Body dependencies
+pip install -r requirements/sam3d.txt
+```
+### Step 2: Install Detectron2
+SAM-3D-Body requires a specific version of Detectron2:
+```bash
+pip install 'git+https://github.com/facebookresearch/detectron2.git@a1ce2f9' \
+    --no-build-isolation --no-deps
+```
+### Step 3: Install MoGe (Optional but Recommended)
+MoGe provides FOV (field-of-view) estimation for better camera calibration:
+```bash
+pip install git+https://github.com/microsoft/MoGe.git
+```
+### Step 4: Install SAM-3D-Body
+```bash
+# Install adapted SAM-3D-Body repository
+pip install git+https://github.com/MiraPurkrabek/sam-3d-body.git
+```
+### Step 5: Get Model Checkpoints
+SAM-3D-Body checkpoints are hosted on HuggingFace. You need to:
+1. **Request access** at [facebook/sam-3d-body-dinov3](https://huggingface.co/facebook/sam-3d-body-dinov3)
+2. **Wait for approval** (usually within 24 hours)
+3. **Authenticate** with HuggingFace:
+   ```bash
+   pip install huggingface_hub
+   huggingface-cli login
+   ```
+The BMPv2 demo will auto-download the checkpoint on first use, or you can download manually to the default location for auto-detection:
+```bash
+# Download checkpoint manually to default location (will be auto-detected)
+mkdir -p checkpoints
+huggingface-cli download facebook/sam-3d-body-dinov3 \
+    --local-dir checkpoints/sam-3d-body-dinov3
+```
+## Usage
+### Basic Usage
+Run the BMPv2 demo with automatic checkpoint handling:
+```bash
+python demos/BMPv2_demo.py --image data/004806.jpg --device cuda
+```
+**The demo will:**
+1. Auto-detect checkpoint in `checkpoints/sam-3d-body-dinov3/` OR download from HuggingFace (~3.5 GB)
+2. Run BMP pipeline to get 2D detections, poses, and masks
+3. Run SAM-3D-Body to recover 3D meshes
+4. Save visualizations to `demos/outputs/bboxmaskpose_v2/`
+### Advanced Usage
+#### Use Local Checkpoint (Auto-Detection)
+Download checkpoint to the default location for automatic detection:
+```bash
+# The demo automatically detects checkpoints in this location
+huggingface-cli download facebook/sam-3d-body-dinov3 \
+    --local-dir checkpoints/sam-3d-body-dinov3
+# Then just run the demo - no checkpoint arguments needed!
+python demos/BMPv2_demo.py --image data/004806.jpg --device cuda
+```
+#### Use Custom Checkpoint Path
+If your checkpoint is in a different location:
+```bash
+python demos/BMPv2_demo.py \
+    --image data/004806.jpg \
+    --device cuda \
+    --sam3d_checkpoint /path/to/model.ckpt \
+    --mhr_path /path/to/mhr_model.pt
+```
+#### Speed vs Quality Trade-offs
+```bash
+# Fastest: body-only inference without mask conditioning
+python demos/BMPv2_demo.py --image data/004806.jpg \
+    --inference_type body --no_mask_conditioning
+# Balanced: body-only with mask conditioning
+python demos/BMPv2_demo.py --image data/004806.jpg \
+    --inference_type body
+# Best quality: full inference with mask conditioning (default)
+python demos/BMPv2_demo.py --image data/004806.jpg
+```
+#### Disable Mask Conditioning
+Faster but less accurate (doesn't use segmentation masks as prompts):
+```bash
+python demos/BMPv2_demo.py \
+    --image data/004806.jpg \
+    --no_mask_conditioning
+```
+#### Skip 3D Recovery
+Run only BMP pipeline (useful for testing BMP without SAM-3D-Body):
+```bash
+python demos/BMPv2_demo.py \
+    --image data/004806.jpg \
+    --skip_3d
+```
+### Output Files
+The demo saves the following visualizations:
+- `{image_name}_bmp_pose.jpg` - 2D pose estimation results
+- `{image_name}_bmp_mask.jpg` - Segmentation mask results
+- `{image_name}_3d_mesh.jpg` - 3D mesh overlay on image
+- `{image_name}_combined.jpg` - Side-by-side comparison of all results
+## Programmatic API
+You can also use SAM-3D-Body programmatically:
+```python
+from bboxmaskpose import BBoxMaskPose
+from bboxmaskpose.sam3d_utils import SAM3DBodyWrapper, visualize_3d_meshes
+# Step 1: Run BMP pipeline
+bmp = BBoxMaskPose(config="bmp_D3", device="cuda")
+result = bmp.predict(image="path/to/image.jpg")
+# Step 2: Initialize SAM-3D-Body
+sam3d = SAM3DBodyWrapper(device="cuda")
+# Step 3: Predict 3D meshes from BMP outputs
+outputs_3d = sam3d.predict(
+    image="path/to/image.jpg",
+    bboxes=result['bboxes'],
+    masks=result['masks'],
+    use_mask=True,
+    inference_type="full",  # Options: "full", "body", "hand"
+)
+# Step 4: Visualize results
+import cv2
+img = cv2.imread("path/to/image.jpg")
+vis = visualize_3d_meshes(img, outputs_3d, sam3d.faces)
+cv2.imwrite("output_3d.jpg", vis)
+```
+### Access 3D Mesh Data
+Each element in `outputs_3d` is a dictionary containing:
+```python
+output_3d[0].keys()
+# dict_keys(['vertices', 'joints', 'bbox', 'mask', ...])
+# 3D mesh vertices in camera coordinates (V, 3)
+vertices = outputs_3d[0]['vertices']
+# 3D joint locations (J, 3)
+joints_3d = outputs_3d[0]['joints']
+# Mesh faces (shared across all people)
+faces = sam3d.faces  # (F, 3)
+```
+## Integration Architecture
+### Wrapper Design
+The integration follows BBoxMaskPose's modular design pattern:
+```
+bboxmaskpose/
+├── sam3d_utils.py          # SAM-3D-Body wrapper (new)
+│   ├── SAM3DBodyWrapper    # Main wrapper class
+│   ├── visualize_3d_meshes # Visualization helper
+│   └── check_sam3d_available
+│
+demos/
+├── BMP_demo.py             # Original BMP demo
+└── BMPv2_demo.py           # New demo with 3D (new)
+```
+### Why a Wrapper?
+The `SAM3DBodyWrapper` class:
+- **Simplifies** SAM-3D-Body's complex initialization
+- **Adapts** BMP outputs (bboxes, masks) to SAM-3D-Body inputs
+- **Handles** optional dependencies gracefully (no hard requirement)
+- **Follows** BMP's design patterns (similar to PMPose wrapper)
+### Key Design Decisions
+1. **Optional Dependency**: SAM-3D-Body is not required for core BMP functionality
+2. **No Code Duplication**: Reuses SAM-3D-Body's existing code via wrapper
+3. **Mask Conditioning**: Leverages BMP's high-quality masks as prompts
+4. **No Internal Detector**: Disables SAM-3D-Body's detector (BMP already detects)
+## Troubleshooting
+### Import Error: `sam_3d_body` not found
+**Solution**: Install SAM-3D-Body following Step 4 above.
+### HuggingFace Authentication Error
+**Solution**:
+1. Request access at https://huggingface.co/facebook/sam-3d-body-dinov3
+2. Login: `huggingface-cli login`
+### MoGe Import Error (FOV Estimator)
+**Solution**: Either:
+- Install MoGe: `pip install git+https://github.com/microsoft/MoGe.git`
+- Or disable FOV estimation (uses default FOV instead)
+### Detectron2 Build Errors
+**Solution**: Make sure you have:
+- CUDA toolkit installed and matching PyTorch CUDA version
+- GCC/G++ compiler available
+- Use the exact commit hash: `@a1ce2f9`
+## References
+- **SAM-3D-Body**: [GitHub](https://github.com/facebookresearch/sam-3d-body) | [Paper](https://ai.meta.com/research/publications/sam-3d-body-robust-full-body-human-mesh-recovery/)
+- **BBoxMaskPose**: [GitHub](https://github.com/MiraPurkrabek/BBoxMaskPose) | [Paper](https://arxiv.org/abs/2601.15200)
+## Citation
+If you use this integration, please cite both works:
+```bibtex
+@article{yang2025sam3dbody,
+  title={SAM 3D Body: Robust Full-Body Human Mesh Recovery},
+  author={Yang, Xitong and Kukreja, Devansh and Pinkus, Don and Sagar, Anushka and Fan, Taosha and Park, Jinhyung and Shin, Soyong and Cao, Jinkun and Liu, Jiawei and Ugrinovic, Nicolas and Feiszli, Matt and Malik, Jitendra and Dollar, Piotr and Kitani, Kris},
+  journal={arXiv preprint; identifier to be added},
+  year={2025}
+}
+@InProceedings{Purkrabek2026BMPv2,
+  author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
+  title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
+  booktitle = {arXiv preprint arXiv:2601.15200},
+  year      = {2026}
+}
+```

app.py CHANGED Viewed

@@ -1,188 +1,204 @@
-import gradio as gr
-import spaces
-from pathlib import Path
 import numpy as np
-import yaml
-from demo.demo_utils import DotDict, concat_instances, filter_instances, pose_nms, visualize_demo
-from demo.mm_utils import run_MMDetector, run_MMPose
-from mmdet.apis import init_detector
-from demo.sam2_utils import prepare_model as prepare_sam2_model
-from demo.sam2_utils import process_image_with_SAM
-from mmpose.apis import init_model as init_pose_estimator
-from mmpose.utils import adapt_mmdet_pipeline
-# Default thresholds
-DEFAULT_CAT_ID: int = 0
-DEFAULT_BBOX_THR: float = 0.3
-DEFAULT_NMS_THR: float = 0.3
-DEFAULT_KPT_THR: float = 0.3
-# Global models variable
-det_model = None
-pose_model = None
-sam2_model = None
-def _parse_yaml_config(yaml_path: Path) -> DotDict:
-    """
-    Load BMP configuration from a YAML file.
-    Args:
-        yaml_path (Path): Path to YAML config.
-    Returns:
-        DotDict: Nested config dictionary.
-    """
-    with open(yaml_path, "r") as f:
-        cfg = yaml.safe_load(f)
-    return DotDict(cfg)
-def load_models(bmp_config):
-    device = 'cuda:0'
-    global det_model, pose_model, sam2_model
-    # build detectors
-    det_model = init_detector(bmp_config.detector.det_config, bmp_config.detector.det_checkpoint, device='cpu') # Detect with CPU because of installation issues on HF
-    det_model.cfg = adapt_mmdet_pipeline(det_model.cfg)
-    # build pose estimator
-    pose_model = init_pose_estimator(
-        bmp_config.pose_estimator.pose_config,
-        bmp_config.pose_estimator.pose_checkpoint,
-        device=device,
-        cfg_options=dict(model=dict(test_cfg=dict(output_heatmaps=False))),
-    )
-    sam2_model = prepare_sam2_model(
-        model_cfg=bmp_config.sam2.sam2_config,
-        model_checkpoint=bmp_config.sam2.sam2_checkpoint,
     )
-    return det_model, pose_model, sam2_model
 @spaces.GPU(duration=60)
 def process_image_with_BMP(
     img: np.ndarray
 ) -> tuple[np.ndarray, np.ndarray]:
     """
-    Run the full BMP pipeline on a single image: detection, pose, SAM mask refinement, and visualization.
-    Args:
-        args (Namespace): Parsed CLI arguments.
-        bmp_config (DotDict): Configuration parameters.
-        img_path (Path): Path to the input image.
-        detector: Primary MMDetection model.
-        detector_prime: Secondary MMDetection model for iterations.
-        pose_estimator: MMPose model for keypoint estimation.
-        sam2_model: SAM model for mask refinement.
-    Returns:
-        InstanceData: Final merged detections and refined masks.
     """
-    bmp_config = _parse_yaml_config(Path("configs/bmp_D3.yaml"))
-    load_models(bmp_config)
-    # img: RGB -> BGR
-    img = img[..., ::-1]
-    img_for_detection = img.copy()
-    rtmdet_result = None
-    all_detections = None
-    for iteration in range(bmp_config.num_bmp_iters):
-        # Step 1: Detection
-        det_instances = run_MMDetector(
-            det_model,
-            img_for_detection,
-            det_cat_id=DEFAULT_CAT_ID,
-            bbox_thr=DEFAULT_BBOX_THR,
-            nms_thr=DEFAULT_NMS_THR,
-        )
-        if len(det_instances.bboxes) == 0:
-            continue
-        # Step 2: Pose estimation
-        pose_instances = run_MMPose(
-            pose_model,
-            img.copy(),
-            detections=det_instances,
-            kpt_thr=DEFAULT_KPT_THR,
-        )
-        # Restrict to first 17 COCO keypoints
-        pose_instances.keypoints = pose_instances.keypoints[:, :17, :]
-        pose_instances.keypoint_scores = pose_instances.keypoint_scores[:, :17]
-        pose_instances.keypoints = np.concatenate(
-            [pose_instances.keypoints, pose_instances.keypoint_scores[:, :, None]], axis=-1
-        )
-        # Step 3: Pose-NMS and SAM refinement
-        all_keypoints = (
-            pose_instances.keypoints
-            if all_detections is None
-            else np.concatenate([all_detections.keypoints, pose_instances.keypoints], axis=0)
-        )
-        all_bboxes = (
-            pose_instances.bboxes
-            if all_detections is None
-            else np.concatenate([all_detections.bboxes, pose_instances.bboxes], axis=0)
-        )
-        num_valid_kpts = np.sum(all_keypoints[:, :, 2] > bmp_config.sam2.prompting.confidence_thr, axis=1)
-        keep_indices = pose_nms(
-            DotDict({"confidence_thr": bmp_config.sam2.prompting.confidence_thr, "oks_thr": bmp_config.oks_nms_thr}),
-            image_kpts=all_keypoints,
-            image_bboxes=all_bboxes,
-            num_valid_kpts=num_valid_kpts,
-        )
-        keep_indices = sorted(keep_indices)  # Sort by original index
-        num_old_detections = 0 if all_detections is None else len(all_detections.bboxes)
-        keep_new_indices = [i - num_old_detections for i in keep_indices if i >= num_old_detections]
-        keep_old_indices = [i for i in keep_indices if i < num_old_detections]
-        if len(keep_new_indices) == 0:
-            continue
-        # filter new detections and compute scores
-        new_dets = filter_instances(pose_instances, keep_new_indices)
-        new_dets.scores = pose_instances.keypoint_scores[keep_new_indices].mean(axis=-1)
-        old_dets = None
-        if len(keep_old_indices) > 0:
-            old_dets = filter_instances(all_detections, keep_old_indices)
-        new_detections = process_image_with_SAM(
-            DotDict(bmp_config.sam2.prompting),
-            img.copy(),
-            sam2_model,
-            new_dets,
-            old_dets if old_dets is not None else None,
-        )
-        # Merge detections
-        if all_detections is None:
-            all_detections = new_detections
-        else:
-            all_detections = concat_instances(all_detections, new_dets)
-        # Step 4: Visualization
-        img_for_detection, rtmdet_r, _ = visualize_demo(
-            img.copy(),
-            all_detections,
-        )
-        if iteration == 0:
-            rtmdet_result = rtmdet_r
-    _, _, bmp_result = visualize_demo(
-        img.copy(),
-        all_detections,
     )
-    # img: BGR -> RGB
-    rtmdet_result = rtmdet_result[..., ::-1]
-    bmp_result = bmp_result[..., ::-1]
-    return rtmdet_result, bmp_result
 with gr.Blocks() as app:
@@ -281,4 +297,4 @@ with gr.Blocks() as app:
     )
 # Launch the demo
-app.launch()

+from typing import Any
+import cv2
+import gradio as gr
 import numpy as np
+import spaces
+from bboxmaskpose import BBoxMaskPose
+# Global BMP model singleton
+bmp_model = None
+bmp_model_config = "bmp_v2"
+bmp_model_device = "cuda:0"
+def _to_numpy(value: Any):
+    """Convert model outputs to numpy arrays when needed."""
+    if value is None:
+        return None
+    if isinstance(value, np.ndarray):
+        return value
+    if hasattr(value, "detach"):
+        return value.detach().cpu().numpy()
+    if hasattr(value, "cpu") and hasattr(value, "numpy"):
+        return value.cpu().numpy()
+    return np.asarray(value)
+def _empty_result(height: int, width: int) -> dict[str, np.ndarray]:
+    """Create an empty result dictionary compatible with BBoxMaskPose.visualize()."""
+    return {
+        "bboxes": np.zeros((0, 4), dtype=np.float32),
+        "masks": np.zeros((0, height, width), dtype=np.uint8),
+        "keypoints": np.zeros((0, 17, 3), dtype=np.float32),
+        "presence": np.zeros((0, 17), dtype=np.float32),
+        "visibility": np.zeros((0, 17), dtype=np.float32),
+    }
+def _normalize_result(result: dict, height: int, width: int) -> dict[str, np.ndarray]:
+    """Normalize prediction dictionary into a robust shape for visualization."""
+    bboxes = _to_numpy(result.get("bboxes"))
+    keypoints = _to_numpy(result.get("keypoints"))
+    masks = _to_numpy(result.get("masks"))
+    presence = _to_numpy(result.get("presence"))
+    visibility = _to_numpy(result.get("visibility"))
+    if bboxes is None:
+        bboxes = np.zeros((0, 4), dtype=np.float32)
+    bboxes = np.asarray(bboxes, dtype=np.float32).reshape(-1, 4)
+    num_instances = bboxes.shape[0]
+    if keypoints is None:
+        keypoints = np.zeros((num_instances, 17, 3), dtype=np.float32)
+    keypoints = np.asarray(keypoints, dtype=np.float32)
+    if keypoints.ndim == 2:
+        keypoints = keypoints[None, ...]
+    if keypoints.shape[0] != num_instances:
+        keypoints = np.zeros((num_instances, 17, 3), dtype=np.float32)
+    if keypoints.shape[1] > 17:
+        keypoints = keypoints[:, :17, :]
+    if keypoints.shape[1] < 17:
+        padded = np.zeros((num_instances, 17, 3), dtype=np.float32)
+        padded[:, : keypoints.shape[1], : min(keypoints.shape[2], 3)] = keypoints[:, :, :3]
+        keypoints = padded
+    if keypoints.shape[2] == 2:
+        scores = np.ones((num_instances, 17, 1), dtype=np.float32)
+        keypoints = np.concatenate([keypoints, scores], axis=-1)
+    elif keypoints.shape[2] > 3:
+        keypoints = keypoints[:, :, :3]
+    if masks is None:
+        masks = np.zeros((num_instances, height, width), dtype=np.uint8)
+    masks = np.asarray(masks)
+    if masks.ndim == 2:
+        masks = masks[None, ...]
+    if masks.ndim == 4 and masks.shape[-1] == 1:
+        masks = masks.squeeze(-1)
+    if masks.shape[0] != num_instances:
+        masks = np.zeros((num_instances, height, width), dtype=np.uint8)
+    masks = masks.astype(np.uint8)
+    if presence is None:
+        presence = keypoints[:, :, 2]
+    presence = np.asarray(presence, dtype=np.float32).reshape(num_instances, -1)
+    if presence.shape[1] > 17:
+        presence = presence[:, :17]
+    if presence.shape[1] < 17:
+        padded_presence = np.zeros((num_instances, 17), dtype=np.float32)
+        padded_presence[:, : presence.shape[1]] = presence
+        presence = padded_presence
+    if visibility is None:
+        visibility = keypoints[:, :, 2]
+    visibility = np.asarray(visibility, dtype=np.float32).reshape(num_instances, -1)
+    if visibility.shape[1] > 17:
+        visibility = visibility[:, :17]
+    if visibility.shape[1] < 17:
+        padded_visibility = np.zeros((num_instances, 17), dtype=np.float32)
+        padded_visibility[:, : visibility.shape[1]] = visibility
+        visibility = padded_visibility
+    return {
+        "bboxes": bboxes,
+        "masks": masks,
+        "keypoints": keypoints,
+        "presence": presence,
+        "visibility": visibility,
+    }
+def _extract_baseline_result(intermediates: Any, fallback_result: dict, height: int, width: int) -> dict[str, np.ndarray]:
+    """Build baseline result from first intermediate pose output."""
+    if not intermediates:
+        return _normalize_result(fallback_result, height, width)
+    first_intermediate = intermediates[0] if len(intermediates) > 0 else None
+    if first_intermediate is None:
+        return _normalize_result(fallback_result, height, width)
+    pose_instances = first_intermediate.get("poses")
+    if pose_instances is None:
+        return _normalize_result(fallback_result, height, width)
+    result = {
+        "bboxes": _to_numpy(getattr(pose_instances, "bboxes", None)),
+        "keypoints": _to_numpy(getattr(pose_instances, "keypoints", None)),
+        "masks": _to_numpy(getattr(pose_instances, "masks", None)),
+        "presence": _to_numpy(getattr(pose_instances, "keypoint_prob", None)),
+        "visibility": _to_numpy(getattr(pose_instances, "keypoint_vis", None)),
+    }
+    return _normalize_result(result, height, width)
+def _blend_pose_and_mask(model: BBoxMaskPose, image_bgr: np.ndarray, result: dict[str, np.ndarray]) -> np.ndarray:
+    """Render pose and mask overlays, then blend them into one image."""
+    pose_vis = model.visualize(image=image_bgr, result=result, vis_type="pose")
+    mask_vis = model.visualize(image=image_bgr, result=result, vis_type="mask")
+    return cv2.addWeighted(pose_vis, 0.5, mask_vis, 0.5, 0.0)
+def _get_bmp_model(config_name: str = "bmp_v2", device: str = "cuda:0") -> BBoxMaskPose:
+    """Lazily initialize and reuse BBoxMaskPose model."""
+    global bmp_model, bmp_model_config, bmp_model_device
+    should_rebuild = (
+        bmp_model is None
+        or bmp_model_config != config_name
+        or bmp_model_device != device
     )
+    if should_rebuild:
+        try:
+            bmp_model = BBoxMaskPose(config=config_name, device=device)
+            bmp_model_config = config_name
+            bmp_model_device = device
+        except Exception as exc:
+            raise RuntimeError(
+                f"Failed to initialize BBoxMaskPose with config='{config_name}', device='{device}'."
+            ) from exc
+    return bmp_model
 @spaces.GPU(duration=60)
 def process_image_with_BMP(
     img: np.ndarray
 ) -> tuple[np.ndarray, np.ndarray]:
     """
+    Run BMP inference using the public BBoxMaskPose API.
+    The function keeps the original Gradio interface:
+    - output 1: baseline-style result from first intermediate pass
+    - output 2: final BMP-refined result
     """
+    if img is None:
+        raise ValueError("Input image is None.")
+    # Gradio image is RGB; BMP API expects BGR.
+    img_bgr = img[..., ::-1].copy()
+    height, width = img_bgr.shape[:2]
+    model = _get_bmp_model(config_name=bmp_model_config, device=bmp_model_device)
+    final_result = model.predict(
+        image=img_bgr,
+        bboxes=None,
+        return_intermediates=True,
     )
+    normalized_final = _normalize_result(final_result, height, width)
+    # No-detection robustness: return original image for both outputs.
+    if normalized_final["bboxes"].shape[0] == 0:
+        original_rgb = img_bgr[..., ::-1]
+        return original_rgb, original_rgb
+    intermediates = final_result.get("intermediates", [])
+    baseline_result = _extract_baseline_result(intermediates, normalized_final, height, width)
+    baseline_vis = _blend_pose_and_mask(model, img_bgr, baseline_result)
+    bmp_vis = _blend_pose_and_mask(model, img_bgr, normalized_final)
+    # BGR -> RGB for Gradio
+    return baseline_vis[..., ::-1], bmp_vis[..., ::-1]
 with gr.Blocks() as app:
     )
 # Launch the demo
+app.launch()

bboxmaskpose/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+# Copyright (c) authors of BBoxMaskPose (BMPv2). All rights reserved.
+"""
+BBoxMaskPose package - Public API for detection, pose estimation, and segmentation.
+This package provides a stable wrapper for the full BBoxMaskPose pipeline.
+"""
+from .api import BBoxMaskPose
+__all__ = ["BBoxMaskPose"]

bboxmaskpose/api.py ADDED Viewed

	@@ -0,0 +1,515 @@

+# Copyright (c) authors of BBoxMaskPose (BMPv2). All rights reserved.
+"""
+Public API for BBoxMaskPose wrapper.
+This module provides a stable, user-friendly interface for the full
+BBoxMaskPose pipeline: detection, pose estimation, and mask refinement.
+"""
+import glob
+import os
+from pathlib import Path
+from typing import Dict, List, Optional, Union
+import cv2
+import mmengine
+import numpy as np
+import yaml
+from mmdet.apis import inference_detector, init_detector
+from mmengine.structures import InstanceData
+from .demo_utils import DotDict, _visualize_predictions, concat_instances, filter_instances, pose_nms
+from .posevis_lite import pose_visualization
+# Import from BBoxMaskPose package
+from .sam2_utils import prepare_model as prepare_sam2_model, process_image_with_SAM
+BMP_ROOT = Path(__file__).parent.parent
+# Note: PMPose will be imported when needed to avoid circular imports
+# from pmpose import PMPose
+# Default detector and pose config
+DEFAULT_DET_CAT_ID: int = 0
+DEFAULT_BBOX_THR: float = 0.3
+DEFAULT_NMS_THR: float = 0.3
+DEFAULT_KPT_THR: float = 0.3
+# Pretrained config URLs (for future use)
+PRETRAINED_CONFIGS = {
+    "bmp-d3": "BMP_D3",
+    "bmp-j1": "BMP_J1",
+}
+class BBoxMaskPose:
+    """
+    Public wrapper API for BBoxMaskPose pipeline.
+    This class provides a complete pipeline for detection, pose estimation,
+    and mask refinement using SAM2.
+    Example:
+        >>> bmp_model = BBoxMaskPose(config="BMP_D3", device="cuda")
+        >>> result = bmp_model.predict(
+        ...     image="path/to/image.jpg",
+        ...     return_intermediates=True
+        ... )
+    """
+    def __init__(
+        self,
+        config: str = "BMP_D3",
+        device: str = "cuda",
+        config_path: Optional[str] = None,
+        pose_model=None,  # Type hint removed to avoid import at module level
+        pretrained_id: Optional[str] = None,
+        n_kpts_to_work_with: Optional[int] = 17,
+    ):
+        """
+        Initialize BBoxMaskPose model.
+        Args:
+            config (str): Config alias ('BMP_D3', 'BMP_J1'). Defaults to 'BMP_D3'.
+            device (str): Device for inference. Defaults to 'cuda'.
+            config_path (str, optional): Path to custom YAML config file.
+            pose_model (PMPose, optional): Pre-initialized PMPose instance.
+                If None, will create internal pose model.
+            pretrained_id (str, optional): Alias for pretrained config.
+            n_kpts_to_work_with (int, optional): Number of keypoints to work with.
+                Defaults to 17 (COCO keypoints).
+        """
+        self.device = device
+        self.config_name = config
+        self.n_kpts_to_work_with = 17  # Hard-code 17 as no experiments were done with other values. Keep the argument for future flexibility, but ignore it for now.
+        if n_kpts_to_work_with != 17:
+            print(
+                f"Warning: n_kpts_to_work_with is set to {n_kpts_to_work_with}, but currently only 17 keypoints are supported. Ignoring this argument for now."
+            )
+        # Determine config path
+        if config_path is not None:
+            self.config_path = config_path
+        else:
+            bmp_configs_root = os.path.join(BMP_ROOT, "bboxmaskpose", "configs")
+            config_file = f"{config}.yaml"
+            self.config_path = os.path.join(bmp_configs_root, config_file)
+            if not os.path.exists(self.config_path):
+                available_configs = glob.glob(os.path.join(bmp_configs_root, "*.yaml"))
+                available_configs = [os.path.basename(f).replace(".yaml", "") for f in available_configs]
+                raise FileNotFoundError(f"Config file not found: {self.config_path}. " f"Available configs: {', '.join(available_configs)}")
+        # Load config
+        self.config = self._load_config(self.config_path)
+        # Initialize or use provided pose model
+        if pose_model is not None:
+            self.pose_model = pose_model
+            self._owns_pose_model = False
+        else:
+            # Create internal PMPose instance
+            self.pose_model = self._create_pose_model()
+            self._owns_pose_model = True
+        # Initialize detector and SAM2
+        self.detector = None
+        self.detector_prime = None
+        self.sam2_model = None
+        self._initialize_models()
+    def _load_config(self, config_path: str) -> DotDict:
+        """Load BMP configuration from YAML file."""
+        with open(config_path, "r") as f:
+            cfg = yaml.safe_load(f)
+        return DotDict(cfg)
+    def _create_pose_model(self):
+        """Create internal PMPose model from config."""
+        # Import PMPose here to avoid circular imports
+        from pmpose import PMPose
+        # Extract pose config from BMP config
+        pose_config = self.config.pose_estimator.pose_config
+        pose_checkpoint = self.config.pose_estimator.pose_checkpoint
+        # Create PMPose instance with custom config
+        full_pose_config = str(BMP_ROOT / pose_config)
+        pose_model = PMPose(
+            device=self.device,
+            config_path=full_pose_config,
+            from_pretrained=True,
+        )
+        # Load checkpoint if it's a local path
+        if not pose_checkpoint.startswith("http"):
+            pose_model.load_from_file(pose_checkpoint)
+        return pose_model
+    def _initialize_models(self):
+        """Initialize detector and SAM2 models."""
+        # Initialize detector
+        self.detector = init_detector(self.config.detector.det_config, self.config.detector.det_checkpoint, device=self.device)
+        # Adapt detector pipeline
+        from mmpose.utils import adapt_mmdet_pipeline
+        self.detector.cfg = adapt_mmdet_pipeline(self.detector.cfg)
+        # Initialize detector prime (may be same as detector)
+        if (
+            self.config.detector.det_config == self.config.detector.det_prime_config
+            and self.config.detector.det_checkpoint == self.config.detector.det_prime_checkpoint
+        ) or (self.config.detector.det_prime_config is None or self.config.detector.det_prime_checkpoint is None):
+            self.detector_prime = self.detector
+        else:
+            self.detector_prime = init_detector(
+                self.config.detector.det_prime_config, self.config.detector.det_prime_checkpoint, device=self.device
+            )
+            self.detector_prime.cfg = adapt_mmdet_pipeline(self.detector_prime.cfg)
+        # Initialize SAM2
+        sam2_config_path = os.path.join(BMP_ROOT, "bboxmaskpose", "sam2", self.config.sam2.sam2_config)
+        self.sam2_model = prepare_sam2_model(
+            model_cfg=sam2_config_path,
+            model_checkpoint=self.config.sam2.sam2_checkpoint,
+        )
+    def predict(
+        self,
+        image: Union[str, np.ndarray],
+        bboxes: Optional[np.ndarray] = None,
+        return_intermediates: bool = False,
+        return_probmaps: bool = False,
+    ) -> Dict:
+        """
+        Run full BBoxMaskPose pipeline on image.
+        Args:
+            image: Image path (str) or BGR numpy array.
+            bboxes: Optional (N, 4) bboxes in [x1, y1, x2, y2] format.
+                If None, run detector.
+            return_intermediates: If True, return intermediate outputs.
+            return_probmaps: If True, request heatmaps from pose model.
+        Returns:
+            Dict with keys:
+                - 'bboxes': (N, 4) final bounding boxes
+                - 'masks': (N, H, W) refined binary masks
+                - 'keypoints': (N, K, 3) keypoints with scores
+                - 'presence': (N, K) presence probabilities
+                - 'visibility': (N, K) visibility flags
+                - 'detector': (optional) raw detector outputs
+                - 'sam2': (optional) intermediate SAM outputs
+        """
+        # Load image
+        if isinstance(image, str):
+            img = cv2.imread(image)
+            if img is None:
+                raise ValueError(f"Failed to load image from {image}")
+        else:
+            img = image.copy()
+        # Run BMP iterations
+        all_detections = None
+        intermediate_results = [] if return_intermediates else None
+        for iteration in range(self.config.num_bmp_iters):
+            # Step 1: Detection
+            if iteration == 0 and bboxes is not None:
+                # Use provided bboxes for first iteration
+                det_instances = InstanceData(bboxes=bboxes, bbox_scores=np.ones(len(bboxes)), masks=None)
+            else:
+                # Run detector
+                det_instances = self._run_detector(
+                    self.detector if iteration == 0 else self.detector_prime,
+                    img if all_detections is None else self._mask_out_image(img, all_detections),
+                )
+            if len(det_instances.bboxes) == 0:
+                continue
+            # Step 2: Pose estimation using PMPose wrapper
+            pose_results = self._run_pose_estimation(img, det_instances, return_probmaps=return_probmaps)
+            # Step 3: Pose NMS and SAM refinement
+            new_detections, old_detections = self._refine_with_sam(
+                img,
+                pose_results,
+                all_detections,
+            )
+            # Merge detections
+            if all_detections is None:
+                all_detections = new_detections
+            else:
+                all_detections = concat_instances(old_detections, new_detections)
+            # Store intermediates if requested
+            if return_intermediates:
+                intermediate_results.append(
+                    {
+                        "iteration": iteration,
+                        "detections": det_instances,
+                        "poses": pose_results,
+                        "refined": new_detections,
+                    }
+                )
+        # Prepare final result
+        result = self._format_result(all_detections, img.shape[:2])
+        if return_intermediates:
+            result["intermediates"] = intermediate_results
+        return result
+    def _run_detector(
+        self,
+        detector,
+        img: np.ndarray,
+    ) -> InstanceData:
+        """Run MMDetection detector."""
+        from mmpose.evaluation.functional import nms
+        # Run detection
+        det_result = inference_detector(detector, img)
+        pred_instances = det_result.pred_instances.cpu().numpy()
+        # Aggregate bboxes and scores
+        bboxes_all = np.concatenate((pred_instances.bboxes, pred_instances.scores[:, None]), axis=1)
+        # Filter by category and score
+        keep_mask = np.logical_and(pred_instances.labels == DEFAULT_DET_CAT_ID, pred_instances.scores > DEFAULT_BBOX_THR)
+        if not np.any(keep_mask):
+            return InstanceData(bboxes=np.zeros((0, 4)), bbox_scores=np.zeros((0,)), masks=np.zeros((0, 1, 1)))
+        bboxes = bboxes_all[keep_mask]
+        masks = getattr(pred_instances, "masks", None)
+        if masks is not None:
+            masks = masks[keep_mask]
+        # Sort by score
+        order = np.argsort(bboxes[:, 4])[::-1]
+        bboxes = bboxes[order]
+        if masks is not None:
+            masks = masks[order]
+        # Apply NMS
+        keep_indices = nms(bboxes, DEFAULT_NMS_THR)
+        bboxes = bboxes[keep_indices]
+        if masks is not None:
+            masks = masks[keep_indices]
+        return InstanceData(bboxes=bboxes[:, :4], bbox_scores=bboxes[:, 4], masks=masks)
+    def _run_pose_estimation(
+        self,
+        img: np.ndarray,
+        det_instances: InstanceData,
+        return_probmaps: bool = False,
+    ) -> InstanceData:
+        """Run pose estimation using PMPose wrapper."""
+        bboxes = det_instances.bboxes
+        masks = det_instances.masks
+        if len(bboxes) == 0:
+            return InstanceData(
+                keypoints=np.zeros((0, self.n_kpts_to_work_with, 3)),
+                keypoint_scores=np.zeros((0, self.n_kpts_to_work_with)),
+                bboxes=bboxes,
+                bbox_scores=det_instances.bbox_scores,
+                masks=masks,
+            )
+        # Call PMPose public API
+        keypoints, probabilities, visibilities, heatmaps = self.pose_model.predict(
+            img,
+            bboxes,
+            masks=masks,
+            return_probmaps=return_probmaps,
+        )
+        # Restrict to first 17 COCO keypoints
+        keypoints = keypoints[:, : self.n_kpts_to_work_with, :]
+        probabilities = probabilities[:, : self.n_kpts_to_work_with]
+        visibilities = visibilities[:, : self.n_kpts_to_work_with]
+        if heatmaps is not None:
+            heatmaps = heatmaps[:, : self.n_kpts_to_work_with, :, :]
+        # Create InstanceData with results
+        result = InstanceData(
+            keypoints=keypoints,
+            keypoint_scores=keypoints[:, :, 2],
+            bboxes=bboxes,
+            bbox_scores=det_instances.bbox_scores,
+            masks=masks,
+            keypoint_vis=visibilities,
+            keypoint_prob=probabilities,
+        )
+        if return_probmaps and heatmaps is not None:
+            result.heatmaps = heatmaps
+        return result
+    def _refine_with_sam(
+        self,
+        img: np.ndarray,
+        pose_instances: InstanceData,
+        all_detections: Optional[InstanceData],
+    ) -> tuple:
+        """Perform Pose-NMS and SAM refinement."""
+        # Combine keypoints with scores
+        keypoints_with_scores = pose_instances.keypoints
+        # Perform Pose-NMS
+        all_keypoints = (
+            keypoints_with_scores if all_detections is None else np.concatenate([all_detections.keypoints, keypoints_with_scores], axis=0)
+        )
+        all_bboxes = (
+            pose_instances.bboxes if all_detections is None else np.concatenate([all_detections.bboxes, pose_instances.bboxes], axis=0)
+        )
+        num_valid_kpts = np.sum(all_keypoints[:, :, 2] > self.config.sam2.prompting.confidence_thr, axis=1)
+        keep_indices = pose_nms(
+            DotDict({"confidence_thr": self.config.sam2.prompting.confidence_thr, "oks_thr": self.config.oks_nms_thr}),
+            image_kpts=all_keypoints,
+            image_bboxes=all_bboxes,
+            num_valid_kpts=num_valid_kpts,
+        )
+        keep_indices = sorted(keep_indices)
+        num_old_detections = 0 if all_detections is None else len(all_detections.bboxes)
+        keep_new_indices = [i - num_old_detections for i in keep_indices if i >= num_old_detections]
+        keep_old_indices = [i for i in keep_indices if i < num_old_detections]
+        if len(keep_new_indices) == 0:
+            return None, all_detections
+        # Filter new detections
+        new_dets = filter_instances(pose_instances, keep_new_indices)
+        new_dets.scores = pose_instances.keypoint_scores[keep_new_indices].mean(axis=-1)
+        old_dets = None
+        if len(keep_old_indices) > 0:
+            old_dets = filter_instances(all_detections, keep_old_indices)
+        # Run SAM refinement
+        new_detections = process_image_with_SAM(
+            DotDict(self.config.sam2.prompting),
+            img.copy(),
+            self.sam2_model,
+            new_dets,
+            old_dets if old_dets is not None else None,
+        )
+        return new_detections, old_dets
+    def _mask_out_image(
+        self,
+        img: np.ndarray,
+        detections: InstanceData,
+    ) -> np.ndarray:
+        """Mask out detected instances from image for next iteration."""
+        masked_img = img.copy()
+        if hasattr(detections, "refined_masks") and detections.refined_masks is not None:
+            for mask in detections.refined_masks:
+                if mask is not None:
+                    masked_img[mask.astype(bool)] = 0
+        return masked_img
+    def _format_result(
+        self,
+        detections: Optional[InstanceData],
+        img_shape: tuple,
+    ) -> Dict:
+        """Format detection results into standard output dict."""
+        if detections is None or len(detections.bboxes) == 0:
+            return {
+                "bboxes": np.zeros((0, 4)),
+                "masks": np.zeros((0, img_shape[0], img_shape[1])),
+                "keypoints": np.zeros((0, 17, 3)),
+                "presence": np.zeros((0, 17)),
+                "visibility": np.zeros((0, 17)),
+            }
+        # Extract refined masks if available
+        if hasattr(detections, "refined_masks") and detections.refined_masks is not None:
+            masks = detections.refined_masks
+        elif hasattr(detections, "pred_masks") and detections.pred_masks is not None:
+            masks = detections.pred_masks
+        elif hasattr(detections, "masks") and detections.masks is not None:
+            masks = detections.masks
+        else:
+            masks = np.zeros((len(detections.bboxes), img_shape[0], img_shape[1]))
+        return {
+            "bboxes": detections.bboxes,
+            "masks": masks,
+            "keypoints": detections.keypoints,
+            "presence": detections.keypoint_prob,
+            "visibility": detections.keypoint_vis,
+        }
+    def visualize(
+        self,
+        image: Union[str, np.ndarray],
+        result: Dict,
+        save_path: Optional[str] = None,
+        vis_type: str = "pose",
+    ) -> np.ndarray:
+        """
+        Visualize BBoxMaskPose results on image.
+        Args:
+            image: Image path (str) or BGR numpy array.
+            result: Result dict from predict().
+            save_path: Optional path to save visualization.
+            vis_type: Type of visualization ("pose" or "mask").
+        Returns:
+            np.ndarray: Visualization image (BGR).
+        """
+        # Load image
+        if isinstance(image, str):
+            img = cv2.imread(image)
+            if img is None:
+                raise ValueError(f"Failed to load image from {image}")
+        else:
+            img = image.copy()
+        if vis_type == "mask":
+            vis_img, _ = _visualize_predictions(
+                img,
+                bboxes=result["bboxes"],
+                scores=np.ones(len(result["bboxes"])),
+                masks=result["masks"],
+                poses=result["keypoints"],
+                vis_type="mask",
+                mask_is_binary=True,
+            )
+            img = vis_img
+        else:
+            # Visualize using posevis_lite
+            keypoints = result["keypoints"]
+            keypoints = keypoints[:, :17, :]  # Use first 17 COCO keypoints
+            img = pose_visualization(
+                img,
+                keypoints,
+                width_multiplier=8,
+                differ_individuals=True,
+                keep_image_size=True,
+            )
+        # Save if requested
+        if save_path is not None:
+            cv2.imwrite(save_path, img)
+        return img

{configs → bboxmaskpose/configs}/README.md RENAMED Viewed

File without changes

{configs → bboxmaskpose/configs}/bmp_D3.yaml RENAMED Viewed

@@ -1,3 +1,8 @@
 # BBoxMaskPose Hyperparameters from Experiment D3.
 # For details, see the paper: https://arxiv.org/abs/2412.01562, Tab 8. in the supplementary.
@@ -11,8 +16,10 @@ detector:
   det_prime_checkpoint: null
 pose_estimator:
-  pose_config: 'mmpose/configs/MaskPose/ViTb-multi_mask.py'
-  pose_checkpoint: 'https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/MaskPose-b.pth'
 sam2:
   sam2_config: 'configs/samurai/sam2.1_hiera_b+.yaml'   # Use SAMURAI as it has img_size 1024 (SAM-2.1 has 512)

+######################################################################################
+### THIS CONFIG IS DEPRACATED AND KEPT ONLY FOR REPRODUCTION OF BMPv1 EXPERIMENTS. ###
+### FOR BMPv2 EXPERIMENTS, PLEASE USE THE  bmp_v2.yaml CONFIG.                     ###
+######################################################################################
 # BBoxMaskPose Hyperparameters from Experiment D3.
 # For details, see the paper: https://arxiv.org/abs/2412.01562, Tab 8. in the supplementary.
   det_prime_checkpoint: null
 pose_estimator:
+  pose_config: 'mmpose/configs/ProbMaskPose/PMPose-b-1.0.0.py'
+  pose_checkpoint: 'models/pose_estimators/PMPose-b-1.0.0.pth'
+  # pose_config: 'mmpose/configs/MaskPose/ViTb-multi_mask.py'
+  # pose_checkpoint: 'https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/MaskPose-b.pth'
 sam2:
   sam2_config: 'configs/samurai/sam2.1_hiera_b+.yaml'   # Use SAMURAI as it has img_size 1024 (SAM-2.1 has 512)

{configs → bboxmaskpose/configs}/bmp_J1.yaml RENAMED Viewed

@@ -1,3 +1,8 @@
 # BBoxMaskPose Hyperparameters from Experiment J1.
 # For details, see the paper: https://arxiv.org/abs/2412.01562, Tab 8. in the supplementary.

+######################################################################################
+### THIS CONFIG IS DEPRACATED AND KEPT ONLY FOR REPRODUCTION OF BMPv1 EXPERIMENTS. ###
+### FOR BMPv2 EXPERIMENTS, PLEASE USE THE  bmp_v2.yaml CONFIG.                     ###
+######################################################################################
 # BBoxMaskPose Hyperparameters from Experiment J1.
 # For details, see the paper: https://arxiv.org/abs/2412.01562, Tab 8. in the supplementary.

bboxmaskpose/configs/bmp_v2.yaml ADDED Viewed

	@@ -0,0 +1,34 @@

+# This configuration is good for the BMP loop as was used for most of the experiments.
+detector:
+  det_config: 'mmpose/configs/mmdet/rtmdet/rtmdet-ins_l_8xb32-300e_coco.py'
+  det_checkpoint: 'https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/rtmdet-ins-l-mask.pth'
+  # Detectors D and D' could be different.
+  det_prime_config: null
+  det_prime_checkpoint: null
+pose_estimator:
+  pose_config: 'mmpose/configs/ProbMaskPose/PMPose-b-1.0.0.py'
+  pose_checkpoint: 'https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/PMPose/PMPose-b-1.0.0.pth'
+sam2:
+  sam2_config: 'configs/sam-pose2seg/sam-pose2seg_hiera_b+.yaml'
+  sam2_checkpoint: 'https://huggingface.co/vrg-prague/BBoxMaskPose/resolve/main/SAM-pose2seg_hiera_b%2B.pt'
+  prompting:
+    batch: False
+    use_bbox: False
+    num_pos_keypoints: 3
+    num_pos_keypoints_if_crowd: 3
+    num_neg_keypoints: 0
+    confidence_thr: 0.5     # not used
+    visibility_thr: 0.5 # not used
+    selection_method: 'k_most_visible'
+    extend_bbox: False
+    pose_mask_consistency: False
+    crowd_by_max_iou: False  # Determine if the instance is in the multi-body scenario. If yes, use different amount of keypoints and NO BBOX. If no, use bbox according to 'use_bbox' argument.
+    crop: False
+    exclusive_masks: True
+    ignore_small_bboxes: False
+num_bmp_iters: 2
+oks_nms_thr: 0.8

{demo → bboxmaskpose}/demo_utils.py RENAMED Viewed

@@ -1,3 +1,4 @@
 """
 Utilities for the BMP demo:
 - Visualization of detections, masks, and poses
@@ -18,9 +19,10 @@ import numpy as np
 from mmengine.logging import print_log
 from mmengine.structures import InstanceData
 from pycocotools import mask as Mask
-from sam2.distinctipy import get_colors
 from tqdm import tqdm
 ### Visualization hyperparameters
 MIN_CONTOUR_AREA: int = 50
 BBOX_WEIGHT: float = 0.9
@@ -38,6 +40,21 @@ except ImportError:
     from .posevis_lite import pose_visualization
 class DotDict(dict):
     """Dictionary with attribute access and nested dict wrapping."""
@@ -68,17 +85,7 @@ def filter_instances(instances: InstanceData, indices):
         return None
     data = {}
     # Attributes to filter
-    for attr in [
-        "bboxes",
-        "bbox_scores",
-        "keypoints",
-        "keypoint_scores",
-        "scores",
-        "pred_masks",
-        "refined_masks",
-        "sam_scores",
-        "sam_kpts",
-    ]:
         if hasattr(instances, attr):
             arr = getattr(instances, attr)
             data[attr] = arr[indices] if arr is not None else None
@@ -95,17 +102,7 @@ def concat_instances(instances1: InstanceData, instances2: InstanceData):
     if instances2 is None:
         return instances1
     data = {}
-    for attr in [
-        "bboxes",
-        "bbox_scores",
-        "keypoints",
-        "keypoint_scores",
-        "scores",
-        "pred_masks",
-        "refined_masks",
-        "sam_scores",
-        "sam_kpts",
-    ]:
         arr1 = getattr(instances1, attr, None)
         arr2 = getattr(instances2, attr, None)
         if arr1 is None and arr2 is None:
@@ -145,43 +142,20 @@ def _visualize_predictions(
     """
     vis_types = vis_type.split("+")
-    # # Filter-out small detections to make the visualization more clear
-    # new_bboxes = []
-    # new_scores = []
-    # new_masks = []
-    # new_poses = []
-    # size_thr = img.shape[0] * img.shape[1] * 0.01
-    # for bbox, score, mask, pose in zip(bboxes, scores, masks, poses):
-    #     area = mask.sum() # Assume binary mask. OK for demo purposes
-    #     if area > size_thr:
-    #         new_bboxes.append(bbox)
-    #         new_scores.append(score)
-    #         new_masks.append(mask)
-    #         new_poses.append(pose)
-    # bboxes = np.array(new_bboxes)
-    # scores = np.array(new_scores)
-    # masks = new_masks
-    # poses = new_poses
     if mask_is_binary:
         poly_masks: List[Optional[List[np.ndarray]]] = []
         for binary_mask in masks:
             if binary_mask is not None:
-                contours, _ = cv2.findContours(
-                    (binary_mask * 255).astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
-                )
                 polys = [cnt.flatten() for cnt in contours if cv2.contourArea(cnt) >= MIN_CONTOUR_AREA]
             else:
                 polys = None
             poly_masks.append(polys)
         masks = poly_masks  # type: ignore
-    # Exclude white, black, and green colors from the palette as they are not distinctive
-    colors = (np.array(get_colors(len(bboxes), exclude_colors=[(0, 1, 0), (.5, .5, .5), (0, 0, 0), (1, 1, 1)], rng=0)) * 255).astype(
-        int
-    )
     if "inv-mask" in vis_types:
         stencil = np.zeros_like(img)
@@ -272,9 +246,7 @@ def visualize_itteration(
             label = "BMP {:d}x: {}".format(iteration_idx + 1, vis_def["label"])
             cv2.putText(vis_img, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 3)
             cv2.putText(vis_img, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
-        out_path = os.path.join(
-            output_root, "{}_iter{}_{}.jpg".format(img_name, iteration_idx + 1, vis_def["label"].replace(" ", "_"))
-        )
         cv2.imwrite(str(out_path), vis_img)
     # Show prompting keypoints
@@ -311,43 +283,6 @@ def visualize_itteration(
     return masked_out
-def visualize_demo(
-    img: np.ndarray, detections: Any,
-) -> Optional[np.ndarray]:
-    """
-    Generate and save visualization images for each BMP iteration.
-    Args:
-        img (np.ndarray): Original input image.
-        detections: InstanceData containing bboxes, scores, masks, keypoints.
-        iteration_idx (int): Current iteration index (0-based).
-        output_root (Path): Directory to save output images.
-        img_name (str): Base name of the image without extension.
-        with_text (bool): Whether to overlay text labels.
-    Returns:
-        Optional[np.ndarray]: The masked-out image if generated, else None.
-    """
-    bboxes = detections.bboxes
-    scores = detections.scores
-    pred_masks = detections.pred_masks
-    refined_masks = detections.refined_masks
-    keypoints = detections.keypoints
-    returns = []
-    for vis_def in [
-        {"type": "mask-out", "masks": refined_masks, "label": ""},
-        {"type": "mask+pose", "masks": pred_masks, "label": "RTMDet-L"},
-        {"type": "mask+pose", "masks": refined_masks, "label": "BMP"},
-    ]:
-        vis_img, colors = _visualize_predictions(
-            img.copy(), bboxes, scores, vis_def["masks"], keypoints, vis_type=vis_def["type"], mask_is_binary=True
-        )
-        returns.append(vis_img)
-    return returns
 def create_GIF(
     img_path: Path,
     output_root: Path,
@@ -419,7 +354,6 @@ def create_GIF(
     # Add 'before' and 'after' images
     after1_img = os.path.join(dirname, "{}_iter{}_Final_Poses.jpg".format(img_name_wo_ext, bmp_x))
     after2_img = os.path.join(dirname, "{}_iter{}_SAM_Masks.jpg".format(img_name_wo_ext, bmp_x))
-    # gif_images.append(os.path.join(dirname, "black_image.jpg"))  # Add black image at the end
     gif_images.append(after1_img)
     gif_images.append(after2_img)
     gif_images.append(os.path.join(dirname, "black_image.jpg"))  # Add black image at the end
@@ -457,10 +391,7 @@ def create_GIF(
         right = "[{}:v]".format(i)
         out = "[v{}]".format(i)
         offset = (i - 1) * (display_dur + fade_dur) + display_dur
-        parts.append(
-            "{}{}xfade=transition=fade:".format(left, right)
-            + "duration={}:offset={:.3f}{}".format(fade_dur, offset, out)
-        )
     filter_complex = ";".join(parts)
     # 3. make MP4 slideshow
@@ -544,9 +475,7 @@ def create_GIF(
     print_log(f"GIF saved as '{gif_output_path}'", logger="current")
-def _update_bbox_by_mask(
-    bbox: List[int], mask_poly: Optional[List[List[int]]], image_shape: Tuple[int, int, int]
-) -> List[int]:
     """
     Adjust bounding box to tightly fit mask polygon.
@@ -591,11 +520,6 @@ def pose_nms(config: Any, image_kpts: np.ndarray, image_bboxes: np.ndarray, num_
     Returns:
         np.ndarray: Indices of kept instances.
     """
-    # Sort image kpts by average score - lowest first
-    # scores = image_kpts[:, :, 2].mean(axis=1)
-    # sort_idx = np.argsort(scores)
-    # image_kpts = image_kpts[sort_idx, :, :]
     # Compute OKS between all pairs of poses
     oks_matrix = np.zeros((image_kpts.shape[0], image_kpts.shape[0]))
     for i in range(image_kpts.shape[0]):
@@ -611,8 +535,7 @@ def pose_nms(config: Any, image_kpts: np.ndarray, image_bboxes: np.ndarray, num_
             dt = {"keypoints": image_kpts[j].copy(), "bbox": gt_bbox_xyxy}
             gt["keypoints"][:, 2] = (gt["keypoints"][:, 2] > config.confidence_thr) * 2
             oks = compute_oks(gt, dt)
-            if oks > 1:
-                breakpoint()
             oks_matrix[i, j] = oks
     np.fill_diagonal(oks_matrix, -1)
@@ -653,13 +576,10 @@ def compute_oks(gt: Dict[str, Any], dt: Dict[str, Any], use_area: bool = True, p
     Returns:
         float: OKS score or mean OKS.
     """
-    sigmas = (
-        np.array([0.26, 0.25, 0.25, 0.35, 0.35, 0.79, 0.79, 0.72, 0.72, 0.62, 0.62, 1.07, 1.07, 0.87, 0.87, 0.89, 0.89])
-        / 10.0
-    )
     vars = (sigmas * 2) ** 2
     k = len(sigmas)
-    visibility_condition = lambda x: x > 0
     g = np.array(gt["keypoints"]).reshape(k, 3)
     xg = g[:, 0]
     yg = g[:, 1]

+# Copyright (c) authors of BBoxMaskPose (BMPv2). All rights reserved.
 """
 Utilities for the BMP demo:
 - Visualization of detections, masks, and poses
 from mmengine.logging import print_log
 from mmengine.structures import InstanceData
 from pycocotools import mask as Mask
 from tqdm import tqdm
+from bboxmaskpose.sam2.distinctipy import get_colors
 ### Visualization hyperparameters
 MIN_CONTOUR_AREA: int = 50
 BBOX_WEIGHT: float = 0.9
     from .posevis_lite import pose_visualization
+WHITELIST_ATTRIBUTES = [
+    "bboxes",
+    "bbox_scores",
+    "keypoints",
+    "keypoint_scores",
+    "scores",
+    "pred_masks",
+    "refined_masks",
+    "sam_scores",
+    "sam_kpts",
+    "keypoint_vis",
+    "keypoint_prob",
+]
 class DotDict(dict):
     """Dictionary with attribute access and nested dict wrapping."""
         return None
     data = {}
     # Attributes to filter
+    for attr in WHITELIST_ATTRIBUTES:
         if hasattr(instances, attr):
             arr = getattr(instances, attr)
             data[attr] = arr[indices] if arr is not None else None
     if instances2 is None:
         return instances1
     data = {}
+    for attr in WHITELIST_ATTRIBUTES:
         arr1 = getattr(instances1, attr, None)
         arr2 = getattr(instances2, attr, None)
         if arr1 is None and arr2 is None:
     """
     vis_types = vis_type.split("+")
+    # Exclude white, black, and green colors from the palette as they are not distinctive
+    colors = (np.array(get_colors(len(bboxes), exclude_colors=[(0, 1, 0), (0, 0, 0), (1, 1, 1)], rng=0)) * 255).astype(int)
     if mask_is_binary:
         poly_masks: List[Optional[List[np.ndarray]]] = []
         for binary_mask in masks:
             if binary_mask is not None:
+                contours, _ = cv2.findContours((binary_mask * 255).astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
                 polys = [cnt.flatten() for cnt in contours if cv2.contourArea(cnt) >= MIN_CONTOUR_AREA]
             else:
                 polys = None
             poly_masks.append(polys)
         masks = poly_masks  # type: ignore
     if "inv-mask" in vis_types:
         stencil = np.zeros_like(img)
             label = "BMP {:d}x: {}".format(iteration_idx + 1, vis_def["label"])
             cv2.putText(vis_img, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 3)
             cv2.putText(vis_img, label, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 255), 2)
+        out_path = os.path.join(output_root, "{}_iter{}_{}.jpg".format(img_name, iteration_idx + 1, vis_def["label"].replace(" ", "_")))
         cv2.imwrite(str(out_path), vis_img)
     # Show prompting keypoints
     return masked_out
 def create_GIF(
     img_path: Path,
     output_root: Path,
     # Add 'before' and 'after' images
     after1_img = os.path.join(dirname, "{}_iter{}_Final_Poses.jpg".format(img_name_wo_ext, bmp_x))
     after2_img = os.path.join(dirname, "{}_iter{}_SAM_Masks.jpg".format(img_name_wo_ext, bmp_x))
     gif_images.append(after1_img)
     gif_images.append(after2_img)
     gif_images.append(os.path.join(dirname, "black_image.jpg"))  # Add black image at the end
         right = "[{}:v]".format(i)
         out = "[v{}]".format(i)
         offset = (i - 1) * (display_dur + fade_dur) + display_dur
+        parts.append("{}{}xfade=transition=fade:".format(left, right) + "duration={}:offset={:.3f}{}".format(fade_dur, offset, out))
     filter_complex = ";".join(parts)
     # 3. make MP4 slideshow
     print_log(f"GIF saved as '{gif_output_path}'", logger="current")
+def _update_bbox_by_mask(bbox: List[int], mask_poly: Optional[List[List[int]]], image_shape: Tuple[int, int, int]) -> List[int]:
     """
     Adjust bounding box to tightly fit mask polygon.
     Returns:
         np.ndarray: Indices of kept instances.
     """
     # Compute OKS between all pairs of poses
     oks_matrix = np.zeros((image_kpts.shape[0], image_kpts.shape[0]))
     for i in range(image_kpts.shape[0]):
             dt = {"keypoints": image_kpts[j].copy(), "bbox": gt_bbox_xyxy}
             gt["keypoints"][:, 2] = (gt["keypoints"][:, 2] > config.confidence_thr) * 2
             oks = compute_oks(gt, dt)
+            assert oks <= 1.0, f"OKS value {oks} exceeds 1.0, which indicates a bug in compute_oks"
             oks_matrix[i, j] = oks
     np.fill_diagonal(oks_matrix, -1)
     Returns:
         float: OKS score or mean OKS.
     """
+    sigmas = np.array([0.26, 0.25, 0.25, 0.35, 0.35, 0.79, 0.79, 0.72, 0.72, 0.62, 0.62, 1.07, 1.07, 0.87, 0.87, 0.89, 0.89]) / 10.0
     vars = (sigmas * 2) ** 2
     k = len(sigmas)
+    visibility_condition = lambda x: x > 0.3
     g = np.array(gt["keypoints"]).reshape(k, 3)
     xg = g[:, 0]
     yg = g[:, 1]

{demo → bboxmaskpose}/posevis_lite.py RENAMED Viewed

@@ -1,9 +1,13 @@
 import os
 from typing import Any, Dict, List, Optional, Tuple, Union
 import cv2
 import numpy as np
 NEUTRAL_COLOR = (52, 235, 107)
 LEFT_ARM_COLOR = (216, 235, 52)
@@ -85,14 +89,6 @@ def _draw_line(
     start = np.array(start)[:2]
     stop = np.array(stop)[:2]
     if line_type.lower() == "solid":
-        img = cv2.line(
-            img,
-            (int(start[0]), int(start[1])),
-            (int(stop[0]), int(stop[1])),
-            color=(0, 0, 0),
-            thickness=thickness+1,
-            lineType=cv2.LINE_AA,
-        )
         img = cv2.line(
             img,
             (int(start[0]), int(start[1])),
@@ -193,7 +189,14 @@ def pose_visualization(
             if not isinstance(color, (list, tuple)):
                 color = [color for keypoint in keypoints]
         else:
-            color = [None for keypoint in keypoints]
         max_padding = [0, 0, 0, 0]
         for keypoint, clr in zip(keypoints, color):
@@ -243,12 +246,9 @@ def pose_visualization(
     # If conf >= confidence_thr: conf = 2
     vis_is_float = np.any(np.logical_and(keypoints[:, -1] > 0, keypoints[:, -1] < 1))
     if keypoints.shape[1] == 3 and vis_is_float:
-        # print("before", keypoints[:, -1])
         lower_idx = keypoints[:, -1] < confidence_thr
         keypoints[lower_idx, -1] = 1
         keypoints[~lower_idx, -1] = 2
-        # print("after", keypoints[:, -1])
-        # print("-"*20)
     # All visibility values should be ints
     keypoints[:, -1] = keypoints[:, -1].astype(int)

+# Copyright (c) authors of BBoxMaskPose (BMPv2). All rights reserved.
 import os
 from typing import Any, Dict, List, Optional, Tuple, Union
 import cv2
 import numpy as np
+from bboxmaskpose.sam2.distinctipy import get_colors
 NEUTRAL_COLOR = (52, 235, 107)
 LEFT_ARM_COLOR = (216, 235, 52)
     start = np.array(start)[:2]
     stop = np.array(stop)[:2]
     if line_type.lower() == "solid":
         img = cv2.line(
             img,
             (int(start[0]), int(start[1])),
             if not isinstance(color, (list, tuple)):
                 color = [color for keypoint in keypoints]
         else:
+            if differ_individuals:
+                color = (
+                    (np.array(get_colors(len(keypoints), exclude_colors=[(0, 1, 0), (0, 0, 0), (1, 1, 1)], rng=0)) * 255)
+                    .astype(int)
+                    .tolist()
+                )
+            else:
+                color = [None for keypoint in keypoints]
         max_padding = [0, 0, 0, 0]
         for keypoint, clr in zip(keypoints, color):
     # If conf >= confidence_thr: conf = 2
     vis_is_float = np.any(np.logical_and(keypoints[:, -1] > 0, keypoints[:, -1] < 1))
     if keypoints.shape[1] == 3 and vis_is_float:
         lower_idx = keypoints[:, -1] < confidence_thr
         keypoints[lower_idx, -1] = 1
         keypoints[~lower_idx, -1] = 2
     # All visibility values should be ints
     keypoints[:, -1] = keypoints[:, -1].astype(int)

{sam2 → bboxmaskpose/sam2}/__init__.py RENAMED Viewed

@@ -8,4 +8,4 @@ from hydra import initialize_config_module
 from hydra.core.global_hydra import GlobalHydra
 if not GlobalHydra.instance().is_initialized():
-    initialize_config_module("sam2", version_base="1.2")

 from hydra.core.global_hydra import GlobalHydra
 if not GlobalHydra.instance().is_initialized():
+    initialize_config_module("bboxmaskpose.sam2", version_base="1.2")

{sam2 → bboxmaskpose/sam2}/automatic_mask_generator.py RENAMED Viewed

@@ -11,9 +11,10 @@ import numpy as np
 import torch
 from torchvision.ops.boxes import batched_nms, box_area  # type: ignore
-from sam2.modeling.sam2_base import SAM2Base
-from sam2.sam2_image_predictor import SAM2ImagePredictor
-from sam2.utils.amg import (
     area_from_rle,
     batch_iterator,
     batched_mask_to_box,
@@ -24,7 +25,6 @@ from sam2.utils.amg import (
     generate_crop_boxes,
     is_box_near_crop_edge,
     mask_to_rle_pytorch,
-    MaskData,
     remove_small_regions,
     rle_to_mask,
     uncrop_boxes_xyxy,
@@ -103,9 +103,7 @@ class SAM2AutomaticMaskGenerator:
           multimask_output (bool): Whether to output multimask at each point of the grid.
         """
-        assert (points_per_side is None) != (
-            point_grids is None
-        ), "Exactly one of points_per_side or point_grid must be provided."
         if points_per_side is not None:
             self.point_grids = build_all_layer_point_grids(
                 points_per_side,
@@ -161,7 +159,7 @@ class SAM2AutomaticMaskGenerator:
         Returns:
           (SAM2AutomaticMaskGenerator): The loaded model.
         """
-        from sam2.build_sam import build_sam2_hf
         sam_model = build_sam2_hf(model_id, **kwargs)
         return cls(sam_model, **kwargs)
@@ -197,9 +195,7 @@ class SAM2AutomaticMaskGenerator:
         # Encode masks
         if self.output_mode == "coco_rle":
-            mask_data["segmentations"] = [
-                coco_encode_rle(rle) for rle in mask_data["rles"]
-            ]
         elif self.output_mode == "binary_mask":
             mask_data["segmentations"] = [rle_to_mask(rle) for rle in mask_data["rles"]]
         else:
@@ -223,9 +219,7 @@ class SAM2AutomaticMaskGenerator:
     def _generate_masks(self, image: np.ndarray) -> MaskData:
         orig_size = image.shape[:2]
-        crop_boxes, layer_idxs = generate_crop_boxes(
-            orig_size, self.crop_n_layers, self.crop_overlap_ratio
-        )
         # Iterate over image crops
         data = MaskData()
@@ -268,9 +262,7 @@ class SAM2AutomaticMaskGenerator:
         # Generate masks for this crop in batches
         data = MaskData()
         for (points,) in batch_iterator(self.points_per_batch, points_for_image):
-            batch_data = self._process_batch(
-                points, cropped_im_size, crop_box, orig_size, normalize=True
-            )
             data.cat(batch_data)
             del batch_data
         self.predictor.reset_predictor()
@@ -302,15 +294,9 @@ class SAM2AutomaticMaskGenerator:
         orig_h, orig_w = orig_size
         # Run model on this batch
-        points = torch.as_tensor(
-            points, dtype=torch.float32, device=self.predictor.device
-        )
-        in_points = self.predictor._transforms.transform_coords(
-            points, normalize=normalize, orig_hw=im_size
-        )
-        in_labels = torch.ones(
-            in_points.shape[0], dtype=torch.int, device=in_points.device
-        )
         masks, iou_preds, low_res_masks = self.predictor._predict(
             in_points[:, None, :],
             in_labels[:, None],
@@ -334,23 +320,15 @@ class SAM2AutomaticMaskGenerator:
                 data.filter(keep_mask)
             # Calculate and filter by stability score
-            data["stability_score"] = calculate_stability_score(
-                data["masks"], self.mask_threshold, self.stability_score_offset
-            )
             if self.stability_score_thresh > 0.0:
                 keep_mask = data["stability_score"] >= self.stability_score_thresh
                 data.filter(keep_mask)
         else:
             # One step refinement using previous mask predictions
-            in_points = self.predictor._transforms.transform_coords(
-                data["points"], normalize=normalize, orig_hw=im_size
-            )
-            labels = torch.ones(
-                in_points.shape[0], dtype=torch.int, device=in_points.device
-            )
-            masks, ious = self.refine_with_m2m(
-                in_points, labels, data["low_res_masks"], self.points_per_batch
-            )
             data["masks"] = masks.squeeze(1)
             data["iou_preds"] = ious.squeeze(1)
@@ -358,9 +336,7 @@ class SAM2AutomaticMaskGenerator:
                 keep_mask = data["iou_preds"] > self.pred_iou_thresh
                 data.filter(keep_mask)
-            data["stability_score"] = calculate_stability_score(
-                data["masks"], self.mask_threshold, self.stability_score_offset
-            )
             if self.stability_score_thresh > 0.0:
                 keep_mask = data["stability_score"] >= self.stability_score_thresh
                 data.filter(keep_mask)
@@ -370,9 +346,7 @@ class SAM2AutomaticMaskGenerator:
         data["boxes"] = batched_mask_to_box(data["masks"])
         # Filter boxes that touch crop boundaries
-        keep_mask = ~is_box_near_crop_edge(
-            data["boxes"], crop_box, [0, 0, orig_w, orig_h]
-        )
         if not torch.all(keep_mask):
             data.filter(keep_mask)
@@ -384,9 +358,7 @@ class SAM2AutomaticMaskGenerator:
         return data
     @staticmethod
-    def postprocess_small_regions(
-        mask_data: MaskData, min_area: int, nms_thresh: float
-    ) -> MaskData:
         """
         Removes small disconnected regions and holes in masks, then reruns
         box NMS to remove any new duplicates.
@@ -438,9 +410,7 @@ class SAM2AutomaticMaskGenerator:
         new_masks = []
         new_iou_preds = []
-        for cur_points, cur_point_labels, low_res_mask in batch_iterator(
-            points_per_batch, points, point_labels, low_res_masks
-        ):
             best_masks, best_iou_preds, _ = self.predictor._predict(
                 cur_points[:, None, :],
                 cur_point_labels[:, None],

 import torch
 from torchvision.ops.boxes import batched_nms, box_area  # type: ignore
+from bboxmaskpose.sam2.modeling.sam2_base import SAM2Base
+from bboxmaskpose.sam2.sam2_image_predictor import SAM2ImagePredictor
+from bboxmaskpose.sam2.utils.amg import (
+    MaskData,
     area_from_rle,
     batch_iterator,
     batched_mask_to_box,
     generate_crop_boxes,
     is_box_near_crop_edge,
     mask_to_rle_pytorch,
     remove_small_regions,
     rle_to_mask,
     uncrop_boxes_xyxy,
           multimask_output (bool): Whether to output multimask at each point of the grid.
         """
+        assert (points_per_side is None) != (point_grids is None), "Exactly one of points_per_side or point_grid must be provided."
         if points_per_side is not None:
             self.point_grids = build_all_layer_point_grids(
                 points_per_side,
         Returns:
           (SAM2AutomaticMaskGenerator): The loaded model.
         """
+        from bboxmaskpose.sam2.build_sam import build_sam2_hf
         sam_model = build_sam2_hf(model_id, **kwargs)
         return cls(sam_model, **kwargs)
         # Encode masks
         if self.output_mode == "coco_rle":
+            mask_data["segmentations"] = [coco_encode_rle(rle) for rle in mask_data["rles"]]
         elif self.output_mode == "binary_mask":
             mask_data["segmentations"] = [rle_to_mask(rle) for rle in mask_data["rles"]]
         else:
     def _generate_masks(self, image: np.ndarray) -> MaskData:
         orig_size = image.shape[:2]
+        crop_boxes, layer_idxs = generate_crop_boxes(orig_size, self.crop_n_layers, self.crop_overlap_ratio)
         # Iterate over image crops
         data = MaskData()
         # Generate masks for this crop in batches
         data = MaskData()
         for (points,) in batch_iterator(self.points_per_batch, points_for_image):
+            batch_data = self._process_batch(points, cropped_im_size, crop_box, orig_size, normalize=True)
             data.cat(batch_data)
             del batch_data
         self.predictor.reset_predictor()
         orig_h, orig_w = orig_size
         # Run model on this batch
+        points = torch.as_tensor(points, dtype=torch.float32, device=self.predictor.device)
+        in_points = self.predictor._transforms.transform_coords(points, normalize=normalize, orig_hw=im_size)
+        in_labels = torch.ones(in_points.shape[0], dtype=torch.int, device=in_points.device)
         masks, iou_preds, low_res_masks = self.predictor._predict(
             in_points[:, None, :],
             in_labels[:, None],
                 data.filter(keep_mask)
             # Calculate and filter by stability score
+            data["stability_score"] = calculate_stability_score(data["masks"], self.mask_threshold, self.stability_score_offset)
             if self.stability_score_thresh > 0.0:
                 keep_mask = data["stability_score"] >= self.stability_score_thresh
                 data.filter(keep_mask)
         else:
             # One step refinement using previous mask predictions
+            in_points = self.predictor._transforms.transform_coords(data["points"], normalize=normalize, orig_hw=im_size)
+            labels = torch.ones(in_points.shape[0], dtype=torch.int, device=in_points.device)
+            masks, ious = self.refine_with_m2m(in_points, labels, data["low_res_masks"], self.points_per_batch)
             data["masks"] = masks.squeeze(1)
             data["iou_preds"] = ious.squeeze(1)
                 keep_mask = data["iou_preds"] > self.pred_iou_thresh
                 data.filter(keep_mask)
+            data["stability_score"] = calculate_stability_score(data["masks"], self.mask_threshold, self.stability_score_offset)
             if self.stability_score_thresh > 0.0:
                 keep_mask = data["stability_score"] >= self.stability_score_thresh
                 data.filter(keep_mask)
         data["boxes"] = batched_mask_to_box(data["masks"])
         # Filter boxes that touch crop boundaries
+        keep_mask = ~is_box_near_crop_edge(data["boxes"], crop_box, [0, 0, orig_w, orig_h])
         if not torch.all(keep_mask):
             data.filter(keep_mask)
         return data
     @staticmethod
+    def postprocess_small_regions(mask_data: MaskData, min_area: int, nms_thresh: float) -> MaskData:
         """
         Removes small disconnected regions and holes in masks, then reruns
         box NMS to remove any new duplicates.
         new_masks = []
         new_iou_preds = []
+        for cur_points, cur_point_labels, low_res_mask in batch_iterator(points_per_batch, points, point_labels, low_res_masks):
             best_masks, best_iou_preds, _ = self.predictor._predict(
                 cur_points[:, None, :],
                 cur_point_labels[:, None],

{sam2 → bboxmaskpose/sam2}/benchmark.py RENAMED Viewed

@@ -11,7 +11,7 @@ import numpy as np
 import torch
 from tqdm import tqdm
-from sam2.build_sam import build_sam2_video_predictor
 # Only cuda supported
 assert torch.cuda.is_available()
@@ -28,19 +28,13 @@ sam2_checkpoint = "checkpoints/sam2.1_hiera_base_plus.pt"
 model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"
 # Build video predictor with vos_optimized=True setting
-predictor = build_sam2_video_predictor(
-    model_cfg, sam2_checkpoint, device=device, vos_optimized=True
-)
 # Initialize with video
 video_dir = "notebooks/videos/bedroom"
 # scan all the JPEG frame names in this directory
-frame_names = [
-    p
-    for p in os.listdir(video_dir)
-    if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
-]
 frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
 inference_state = predictor.init_state(video_path=video_dir)

 import torch
 from tqdm import tqdm
+from bboxmaskpose.sam2.build_sam import build_sam2_video_predictor
 # Only cuda supported
 assert torch.cuda.is_available()
 model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"
 # Build video predictor with vos_optimized=True setting
+predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint, device=device, vos_optimized=True)
 # Initialize with video
 video_dir = "notebooks/videos/bedroom"
 # scan all the JPEG frame names in this directory
+frame_names = [p for p in os.listdir(video_dir) if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]]
 frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
 inference_state = predictor.init_state(video_path=video_dir)

{sam2 → bboxmaskpose/sam2}/build_sam.py RENAMED Viewed

@@ -6,14 +6,16 @@
 import logging
 import os
 import torch
-from hydra import compose
 from hydra.utils import instantiate
 from omegaconf import OmegaConf
-import sam2
 # Check if the user is running Python from the parent directory of the sam2 repo
 # (i.e. the directory where this repo is cloned into) -- this is not supported since
 # it could shadow the sam2 package and cause issues.
@@ -86,13 +88,26 @@ def build_sam2(
             "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_delta=0.05",
             "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_thresh=0.98",
         ]
     # Read config and init model
     try:
-        cfg = compose(config_name=config_file)
     except Exception as e:
         logging.error(f"Error loading config: {e}")
         cfg = compose(config_name=config_file, overrides=hydra_overrides_extra)
     OmegaConf.resolve(cfg)
     model = instantiate(cfg.model, _recursive_=True)
     _load_checkpoint(model, ckpt_path)
@@ -161,14 +176,23 @@ def build_sam2_hf(model_id, **kwargs):
 def build_sam2_video_predictor_hf(model_id, **kwargs):
     config_name, ckpt_path = _hf_download(model_id)
-    return build_sam2_video_predictor(
-        config_file=config_name, ckpt_path=ckpt_path, **kwargs
-    )
 def _load_checkpoint(model, ckpt_path):
     if ckpt_path is not None:
-        sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)["model"]
         missing_keys, unexpected_keys = model.load_state_dict(sd)
         if missing_keys:
             logging.error(missing_keys)
@@ -176,4 +200,5 @@ def _load_checkpoint(model, ckpt_path):
         if unexpected_keys:
             logging.error(unexpected_keys)
             raise RuntimeError()
         logging.info("Loaded checkpoint sucessfully")

 import logging
 import os
+import urllib.parse as urlparse
 import torch
+import bboxmaskpose.sam2 as sam2
+from hydra import compose, initialize_config_dir
+from hydra.core.global_hydra import GlobalHydra
 from hydra.utils import instantiate
 from omegaconf import OmegaConf
 # Check if the user is running Python from the parent directory of the sam2 repo
 # (i.e. the directory where this repo is cloned into) -- this is not supported since
 # it could shadow the sam2 package and cause issues.
             "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_delta=0.05",
             "++model.sam_mask_decoder_extra_args.dynamic_multimask_stability_thresh=0.98",
         ]
+    # IMPORTANT: compose() requires Hydra to be initialized with a config source.
+    # Also important if build_sam2() can be called multiple times in one process.
+    GlobalHydra.instance().clear()
+    # Point Hydra at the directory that contains the SAM2 yaml configs
+    config_dir = os.path.dirname(config_file)
+    # Hydra expects config_name WITHOUT .yaml
+    config_name = os.path.basename(config_file).replace(".yaml", "")
     # Read config and init model
     try:
+        with initialize_config_dir(version_base=None, config_dir=str(config_dir)):
+            cfg = compose(config_name=config_name, overrides=hydra_overrides_extra)
+        # cfg = compose(config_name=config_file)
     except Exception as e:
         logging.error(f"Error loading config: {e}")
         cfg = compose(config_name=config_file, overrides=hydra_overrides_extra)
     OmegaConf.resolve(cfg)
     model = instantiate(cfg.model, _recursive_=True)
     _load_checkpoint(model, ckpt_path)
 def build_sam2_video_predictor_hf(model_id, **kwargs):
     config_name, ckpt_path = _hf_download(model_id)
+    return build_sam2_video_predictor(config_file=config_name, ckpt_path=ckpt_path, **kwargs)
+def _is_url(path: str) -> bool:
+    return urlparse.urlparse(path).scheme != ""
 def _load_checkpoint(model, ckpt_path):
     if ckpt_path is not None:
+        if _is_url(ckpt_path):
+            sd = torch.hub.load_state_dict_from_url(ckpt_path, map_location="cpu", weights_only=True)["model"]
+        elif os.path.exists(ckpt_path):
+            sd = torch.load(ckpt_path, map_location="cpu", weights_only=True)["model"]
+        else:
+            raise FileNotFoundError(f"Checkpoint not found: {ckpt_path}")
         missing_keys, unexpected_keys = model.load_state_dict(sd)
         if missing_keys:
             logging.error(missing_keys)
         if unexpected_keys:
             logging.error(unexpected_keys)
             raise RuntimeError()
         logging.info("Loaded checkpoint sucessfully")

{sam2 → bboxmaskpose/sam2}/colorblind.py RENAMED Viewed

@@ -1,7 +1,10 @@
 """
 Adapted from "The Color Blind Simulation function" by Matthew Wickline
 and the Human - Computer Interaction Resource Network (http://hcirn.com/), 2000 - 2001.
 """
 import numpy as np
 rBlind = {
@@ -261,16 +264,13 @@ def simulate_colors(colors, colorblind_type="Deuteranomaly", one_row=None, show=
     :return:
     """
     import matplotlib.pyplot as plt
     from distinctipy import distinctipy
     filtered_colors = [colorblind_filter(color, colorblind_type) for color in colors]
     fig, axes = plt.subplots(1, 2, figsize=(8, 4))
-    distinctipy.color_swatch(
-        colors, ax=axes[0], one_row=one_row, title="Viewed with Normal Sight"
-    )
     distinctipy.color_swatch(
         filtered_colors,
@@ -324,30 +324,22 @@ def simulate_clusters(
     """
     import matplotlib.pyplot as plt
     import pandas as pd
     from distinctipy import distinctipy
     if dataset not in ("s1", "s2", "s3", "s4", "a1", "a2", "a3", "b1"):
         raise ValueError("dataset must be s1, s2, s3, s4, a1, a2, a3 or b1")
-    URL = (
-        "https://raw.githubusercontent.com/alan-turing-institute/distinctipy/"
-        "main/distinctipy/datasets/"
-    )
     df = pd.read_csv(URL + dataset + ".csv")
     if colorblind_distinct:
-        orig_colors = distinctipy.get_colors(
-            df["cluster"].nunique(), colorblind_type=colorblind_type
-        )
     else:
         orig_colors = distinctipy.get_colors(df["cluster"].nunique())
     orig_cmap = distinctipy.get_colormap(orig_colors)
-    filtered_colors = [
-        colorblind_filter(color, colorblind_type) for color in orig_colors
-    ]
     filtered_cmap = distinctipy.get_colormap(filtered_colors)
     fig, axes = plt.subplots(1, 2, figsize=(10, 5))
@@ -376,4 +368,4 @@ def _main():
 if __name__ == "__main__":
-    _main()

+# Adapted from the distinctipy repository (https://github.com/alan-turing-institute/distinctipy).
+# Original authors: distinctipy contributors. Included with minor modifications.
 """
 Adapted from "The Color Blind Simulation function" by Matthew Wickline
 and the Human - Computer Interaction Resource Network (http://hcirn.com/), 2000 - 2001.
 """
 import numpy as np
 rBlind = {
     :return:
     """
     import matplotlib.pyplot as plt
     from distinctipy import distinctipy
     filtered_colors = [colorblind_filter(color, colorblind_type) for color in colors]
     fig, axes = plt.subplots(1, 2, figsize=(8, 4))
+    distinctipy.color_swatch(colors, ax=axes[0], one_row=one_row, title="Viewed with Normal Sight")
     distinctipy.color_swatch(
         filtered_colors,
     """
     import matplotlib.pyplot as plt
     import pandas as pd
     from distinctipy import distinctipy
     if dataset not in ("s1", "s2", "s3", "s4", "a1", "a2", "a3", "b1"):
         raise ValueError("dataset must be s1, s2, s3, s4, a1, a2, a3 or b1")
+    URL = "https://raw.githubusercontent.com/alan-turing-institute/distinctipy/" "main/distinctipy/datasets/"
     df = pd.read_csv(URL + dataset + ".csv")
     if colorblind_distinct:
+        orig_colors = distinctipy.get_colors(df["cluster"].nunique(), colorblind_type=colorblind_type)
     else:
         orig_colors = distinctipy.get_colors(df["cluster"].nunique())
     orig_cmap = distinctipy.get_colormap(orig_colors)
+    filtered_colors = [colorblind_filter(color, colorblind_type) for color in orig_colors]
     filtered_cmap = distinctipy.get_colormap(filtered_colors)
     fig, axes = plt.subplots(1, 2, figsize=(10, 5))
 if __name__ == "__main__":
+    _main()

bboxmaskpose/sam2/configs/sam-pose2seg/sam-pose2seg_hiera_b+.yaml ADDED Viewed

	@@ -0,0 +1,118 @@

+# @package _global_
+# Model
+model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
+  image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
+    scalp: 1
+    trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
+      embed_dim: 112
+      num_heads: 2
+    neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
+      position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 256
+        normalize: true
+        scale: null
+        temperature: 10000
+      d_model: 256
+      backbone_channel_list: [896, 448, 224, 112]
+      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+      fpn_interp_model: nearest
+  memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
+    d_model: 256
+    pos_enc_at_input: true
+    layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
+      activation: relu
+      dim_feedforward: 2048
+      dropout: 0.1
+      pos_enc_at_attn: false
+      self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [64, 64]
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+      d_model: 256
+      pos_enc_at_cross_attn_keys: true
+      pos_enc_at_cross_attn_queries: false
+      cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
+        rope_theta: 10000.0
+        feat_sizes: [64, 64]
+        rope_k_repeat: True
+        embedding_dim: 256
+        num_heads: 1
+        downsample_rate: 1
+        dropout: 0.1
+        kv_in_dim: 64
+    num_layers: 4
+  memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
+      out_dim: 64
+      position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
+        num_pos_feats: 64
+        normalize: true
+        scale: null
+        temperature: 10000
+      mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
+        kernel_size: 3
+        stride: 2
+        padding: 1
+      fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
+        layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
+          dim: 256
+          kernel_size: 7
+          padding: 3
+          layer_scale_init_value: 1e-6
+          use_dwconv: True  # depth-wise convs
+        num_layers: 2
+  num_maskmem: 7
+  image_size: 1024
+  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+  sigmoid_scale_for_mem_enc: 20.0
+  sigmoid_bias_for_mem_enc: -10.0
+  use_mask_input_as_output_without_sam: true
+  # Memory
+  directly_add_no_mem_embed: true
+  no_obj_embed_spatial: true
+  # use high-resolution feature map in the SAM mask decoder
+  use_high_res_features_in_sam: true
+  # output 3 masks on the first click on initial conditioning frames
+  multimask_output_in_sam: true
+  # SAM heads
+  iou_prediction_use_sigmoid: True
+  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+  use_obj_ptrs_in_encoder: true
+  add_tpos_enc_to_obj_ptrs: true
+  proj_tpos_enc_in_obj_ptrs: true
+  use_signed_tpos_enc_to_obj_ptrs: true
+  only_obj_ptrs_in_the_past_for_eval: true
+  # object occlusion prediction
+  pred_obj_scores: true
+  pred_obj_scores_mlp: true
+  fixed_no_obj_ptr: true
+  # multimask tracking settings
+  multimask_output_for_tracking: true
+  use_multimask_token_for_obj_ptr: true
+  multimask_min_pt_num: 0
+  multimask_max_pt_num: 1
+  use_mlp_for_obj_ptr_proj: true
+  n_kpts_encoder: 8
+  # Compilation flag
+  # compile_image_encoder: False

{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_b+.yaml RENAMED Viewed

@@ -2,18 +2,18 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 112
       num_heads: 2
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -24,17 +24,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
@@ -45,7 +45,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
@@ -57,23 +57,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 112
       num_heads: 2
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_l.yaml RENAMED Viewed

@@ -2,12 +2,12 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 144
       num_heads: 2
       stages: [2, 6, 36, 4]
@@ -15,9 +15,9 @@ model:
       window_pos_embed_bkg_spatial_size: [7, 7]
       window_spec: [8, 4, 16, 8]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -28,17 +28,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
@@ -49,7 +49,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
@@ -61,23 +61,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 144
       num_heads: 2
       stages: [2, 6, 36, 4]
       window_pos_embed_bkg_spatial_size: [7, 7]
       window_spec: [8, 4, 16, 8]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_s.yaml RENAMED Viewed

@@ -2,21 +2,21 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 11, 2]
       global_att_blocks: [7, 10, 13]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -27,17 +27,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
@@ -48,7 +48,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
@@ -60,23 +60,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 11, 2]
       global_att_blocks: [7, 10, 13]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2.1/sam2.1_hiera_t.yaml RENAMED Viewed

@@ -2,21 +2,21 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 7, 2]
       global_att_blocks: [5, 7, 9]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -27,17 +27,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
@@ -48,7 +48,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
@@ -60,23 +60,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 7, 2]
       global_att_blocks: [5, 7, 9]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [64, 64]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

bboxmaskpose/sam2/configs/sam2.1_training/sam2.1_hiera_b+_COCO+CIHP_finetune_sam-pose2seg.yaml ADDED Viewed

	@@ -0,0 +1,343 @@

+# @package _global_
+scratch:
+  resolution: 1024
+  train_batch_size: 1
+  num_train_workers: 10
+  num_frames: 1
+  max_num_objects: 1
+  base_lr: 5.0e-6
+  vision_lr: 3.0e-06
+  phases_per_epoch: 1
+  num_epochs: 15
+dataset:
+  # PATHS to Dataset
+  img_folder: path/to/datasett
+  gt_folder: path/to/dataset
+  multiplier: 2
+# Video transforms
+vos:
+  train_transforms:
+    - _target_: training.dataset.transforms.ComposeAPI
+      transforms:
+        - _target_: training.dataset.transforms.RandomHorizontalFlip
+          consistent_transform: True
+        - _target_: training.dataset.transforms.RandomAffine
+          degrees: 25
+          shear: 20
+          image_interpolation: bilinear
+          consistent_transform: True
+        - _target_: training.dataset.transforms.RandomResizeAPI
+          sizes: ${scratch.resolution}
+          square: true
+          consistent_transform: True
+        - _target_: training.dataset.transforms.ColorJitter
+          consistent_transform: True
+          brightness: 0.1
+          contrast: 0.03
+          saturation: 0.03
+          hue: null
+        - _target_: training.dataset.transforms.RandomGrayscale
+          p: 0.05
+          consistent_transform: True
+        - _target_: training.dataset.transforms.ColorJitter
+          consistent_transform: False
+          brightness: 0.1
+          contrast: 0.05
+          saturation: 0.05
+          hue: null
+        - _target_: training.dataset.transforms.ToTensorAPI
+        - _target_: training.dataset.transforms.NormalizeAPI
+          mean: [0.485, 0.456, 0.406]
+          std: [0.229, 0.224, 0.225]
+trainer:
+  _target_: training.trainer.Trainer
+  mode: train_only # change to train ? (a.k.a. train + val)
+  max_epochs: ${times:${scratch.num_epochs},${scratch.phases_per_epoch}}
+  accelerator: cuda
+  seed_value: 123
+  model:
+    _target_: training.model.sam2.SAM2Train
+    image_encoder:
+      _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
+      scalp: 1
+      trunk:
+        _target_: sam2.modeling.backbones.hieradet.Hiera
+        embed_dim: 112
+        num_heads: 2
+        drop_path_rate: 0.1
+      neck:
+        _target_: sam2.modeling.backbones.image_encoder.FpnNeck
+        position_encoding:
+          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+          num_pos_feats: 256
+          normalize: true
+          scale: null
+          temperature: 10000
+        d_model: 256
+        backbone_channel_list: [896, 448, 224, 112]
+        fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
+        fpn_interp_model: nearest
+    memory_attention:
+      _target_: sam2.modeling.memory_attention.MemoryAttention
+      d_model: 256
+      pos_enc_at_input: true
+      layer:
+        _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
+        activation: relu
+        dim_feedforward: 2048
+        dropout: 0.1
+        pos_enc_at_attn: false
+        self_attention:
+          _target_: sam2.modeling.sam.transformer.RoPEAttention
+          rope_theta: 10000.0
+          feat_sizes: [64, 64]
+          embedding_dim: 256
+          num_heads: 1
+          downsample_rate: 1
+          dropout: 0.1
+        d_model: 256
+        pos_enc_at_cross_attn_keys: true
+        pos_enc_at_cross_attn_queries: false
+        cross_attention:
+          _target_: sam2.modeling.sam.transformer.RoPEAttention
+          rope_theta: 10000.0
+          feat_sizes: [64, 64]
+          rope_k_repeat: True
+          embedding_dim: 256
+          num_heads: 1
+          downsample_rate: 1
+          dropout: 0.1
+          kv_in_dim: 64
+      num_layers: 4
+    memory_encoder:
+        _target_: sam2.modeling.memory_encoder.MemoryEncoder
+        out_dim: 64
+        position_encoding:
+          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
+          num_pos_feats: 64
+          normalize: true
+          scale: null
+          temperature: 10000
+        mask_downsampler:
+          _target_: sam2.modeling.memory_encoder.MaskDownSampler
+          kernel_size: 3
+          stride: 2
+          padding: 1
+        fuser:
+          _target_: sam2.modeling.memory_encoder.Fuser
+          layer:
+            _target_: sam2.modeling.memory_encoder.CXBlock
+            dim: 256
+            kernel_size: 7
+            padding: 3
+            layer_scale_init_value: 1e-6
+            use_dwconv: True  # depth-wise convs
+          num_layers: 2
+    num_maskmem: 7
+    image_size: ${scratch.resolution}
+    # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
+    sigmoid_scale_for_mem_enc: 20.0
+    sigmoid_bias_for_mem_enc: -10.0
+    use_mask_input_as_output_without_sam: true
+    # Memory
+    directly_add_no_mem_embed: true
+    no_obj_embed_spatial: true
+    # use high-resolution feature map in the SAM mask decoder
+    use_high_res_features_in_sam: true
+    # output 3 masks on the first click on initial conditioning frames
+    multimask_output_in_sam: true
+    # SAM heads
+    iou_prediction_use_sigmoid: True
+    # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
+    use_obj_ptrs_in_encoder: true
+    add_tpos_enc_to_obj_ptrs: true
+    proj_tpos_enc_in_obj_ptrs: true
+    use_signed_tpos_enc_to_obj_ptrs: true
+    only_obj_ptrs_in_the_past_for_eval: true
+    # object occlusion prediction
+    pred_obj_scores: true
+    pred_obj_scores_mlp: true
+    fixed_no_obj_ptr: true
+    # multimask tracking settings
+    multimask_output_for_tracking: true
+    use_multimask_token_for_obj_ptr: false
+    multimask_min_pt_num: 0
+    multimask_max_pt_num: 1
+    use_mlp_for_obj_ptr_proj: true
+    n_kpts_encoder: 8
+    # Compilation flag
+    # compile_image_encoder: False
+    ####### Training specific params #######
+    # box/point input and corrections
+    prob_to_use_pt_input_for_train: 1.0
+    prob_to_use_pt_input_for_eval: 0.0
+    prob_to_use_box_input_for_train: 0.0  # 0.5*0.5 = 0.25 prob to use box instead of points
+    prob_to_use_box_input_for_eval: 0.0
+    prob_to_sample_from_gt_for_train: 0.1  # with a small prob, sampling correction points from GT mask instead of prediction errors
+    num_frames_to_correct_for_train: 2  # iteratively sample on random 1~2 frames (always include the first frame)
+    num_frames_to_correct_for_eval: 1  # only iteratively sample on first frame
+    rand_frames_to_correct_for_train: True  # random #init-cond-frame ~ 2
+    add_all_frames_to_correct_as_cond: True  # when a frame receives a correction click, it becomes a conditioning frame (even if it's not initially a conditioning frame)
+    # maximum 2 initial conditioning frames
+    num_init_cond_frames_for_train: 2
+    rand_init_cond_frames_for_train: True  # random 1~2
+    num_correction_pt_per_frame: 7 ## CHANGED
+    use_act_ckpt_iterative_pt_sampling: false
+    num_init_cond_frames_for_eval: 1  # only mask on the first frame
+    forward_backbone_per_frame_for_eval: True
+  data:
+    train:
+      _target_: training.dataset.sam2_datasets.TorchTrainMixedDataset
+      phases_per_epoch: ${scratch.phases_per_epoch}
+      batch_sizes:
+        - ${scratch.train_batch_size}
+      datasets:
+        - _target_: training.dataset.utils.RepeatFactorWrapper
+          dataset:
+            _target_: training.dataset.utils.ConcatDataset
+            datasets:
+            - _target_: training.dataset.vos_dataset.VOSDataset
+              transforms: ${vos.train_transforms}
+              training: true
+              video_dataset:
+                _target_: training.dataset.vos_raw_dataset.SA1BRawDataset
+                img_folder: ${dataset.img_folder}
+                gt_folder: ${dataset.gt_folder}
+                # file_list_txt: ${dataset.file_list_txt}
+              sampler:
+                _target_: training.dataset.vos_sampler.RandomUniformSampler
+                num_frames: ${scratch.num_frames}
+                max_num_objects: ${scratch.max_num_objects}
+              multiplier: ${dataset.multiplier}
+      shuffle: True
+      num_workers: ${scratch.num_train_workers}
+      pin_memory: True
+      drop_last: True
+      collate_fn:
+        _target_: training.utils.data_utils.collate_fn
+        _partial_: true
+        dict_key: all
+   # val:
+  optim:
+    amp:
+      enabled: True
+      amp_dtype: bfloat16
+    optimizer:
+      _target_: torch.optim.AdamW
+    gradient_clip:
+      _target_: training.optimizer.GradientClipper
+      max_norm: 0.1
+      norm_type: 2
+    param_group_modifiers:
+      - _target_: training.optimizer.layer_decay_param_modifier
+        _partial_: True
+        layer_decay_value: 0.9
+        apply_to: 'image_encoder.trunk'
+        overrides:
+          - pattern: '*pos_embed*'
+            value: 1.0
+    options:
+      lr:
+        - scheduler:
+            _target_: fvcore.common.param_scheduler.CosineParamScheduler
+            start_value: ${scratch.base_lr}
+            end_value: ${divide:${scratch.base_lr},10}
+        - scheduler:
+            _target_: fvcore.common.param_scheduler.CosineParamScheduler
+            start_value: ${scratch.vision_lr}
+            end_value: ${divide:${scratch.vision_lr},10}
+          param_names:
+            - 'image_encoder.*'
+      weight_decay:
+        - scheduler:
+            _target_: fvcore.common.param_scheduler.ConstantParamScheduler
+            value: 0.1
+        - scheduler:
+            _target_: fvcore.common.param_scheduler.ConstantParamScheduler
+            value: 0.0
+          param_names:
+            - '*bias*'
+          module_cls_names: ['torch.nn.LayerNorm']
+  loss:
+    all:
+      _target_: training.loss_fns.MultiStepMultiMasksAndIous
+      weight_dict:
+        loss_mask: 20
+        loss_dice: 1
+        loss_iou: 1
+        loss_class: 1
+      supervise_all_iou: true
+      iou_use_l1_loss: true
+      pred_obj_scores: true
+      focal_gamma_obj_score: 0.0
+      focal_alpha_obj_score: -1.0
+  distributed:
+    backend: nccl
+    find_unused_parameters: True
+  logging:
+    tensorboard_writer:
+      _target_: training.utils.logger.make_tensorboard_logger
+      log_dir:  ${launcher.experiment_log_dir}/tensorboard
+      flush_secs: 120
+      should_log: True
+    log_dir: ${launcher.experiment_log_dir}/logs
+    log_freq: 10
+  # initialize from a SAM 2 checkpoint
+  checkpoint:
+    save_dir: ${launcher.experiment_log_dir}/checkpoints
+    save_freq: 0 # 0 only last checkpoint is saved.
+    model_weight_initializer:
+      _partial_: True
+      _target_: training.utils.checkpoint_utils.load_state_dict_into_model
+      strict: True
+      ignore_unexpected_keys: null
+      ignore_missing_keys: null
+      state_dict:
+        _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
+        checkpoint_path: ./checkpoints/sam2.1_hiera_base_plus.pt ## CHANGED - PATH to SAM 2.1 checkpoint
+        ckpt_state_dict_keys: ['model']
+launcher:
+  num_nodes: 1
+  gpus_per_node: 8
+  experiment_log_dir: null # Path to log directory, defaults to ./sam2_logs/${config_name}
+# SLURM args if running on a cluster
+submitit:
+  partition: null
+  account: null
+  qos: null
+  cpus_per_task: 10
+  use_cluster: false
+  timeout_hour: 24
+  name: null
+  port_range: [10000, 65000]

{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_1024_prompt.yaml RENAMED Viewed

@@ -11,14 +11,6 @@ scratch:
   phases_per_epoch: 1
   num_epochs: 40
-dataset:
-  # PATHS to Dataset
-  img_folder: /mnt/personal/purkrmir/data/COCO/original/train2017/ # PATH to MOSE JPEGImages folder
-  gt_folder: /mnt/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  # img_folder: /datagrid/personal/purkrmir/data/COCO/original/val2017/ # PATH to MOSE JPEGImages folder
-  # gt_folder: /datagrid/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  file_list_txt: null # Optional PATH to filelist containing a subset of videos to be used for training
-  multiplier: 2
 # Video transforms
 vos:
@@ -69,19 +61,19 @@ trainer:
   unfreeze_decoder: False
   model:
-    _target_: training.model.sam2.SAM2Train
     image_encoder:
-      _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
-        _target_: sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
-        _target_: sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
@@ -92,17 +84,17 @@ trainer:
         fpn_interp_model: nearest
     memory_attention:
-      _target_: sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
-        _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
@@ -113,7 +105,7 @@ trainer:
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
@@ -125,23 +117,23 @@ trainer:
       num_layers: 4
     memory_encoder:
-        _target_: sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
-          _target_: sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
-          _target_: sam2.modeling.memory_encoder.Fuser
           layer:
-            _target_: sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
@@ -325,7 +317,7 @@ trainer:
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
-        checkpoint_path: ./checkpoints/sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

   phases_per_epoch: 1
   num_epochs: 40
 # Video transforms
 vos:
   unfreeze_decoder: False
   model:
+    _target_: training.model.bboxmaskpose.sam2.sam2Train
     image_encoder:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
+        _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
+        _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
         fpn_interp_model: nearest
     memory_attention:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
+        _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
       num_layers: 4
     memory_encoder:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
           layer:
+            _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
+        checkpoint_path: ./checkpoints/bboxmaskpose.sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_finetune.yaml RENAMED Viewed

@@ -11,15 +11,6 @@ scratch:
   phases_per_epoch: 1
   num_epochs: 40
-dataset:
-  # PATHS to Dataset
-  img_folder: /mnt/personal/purkrmir/data/COCO/original/train2017/ # PATH to MOSE JPEGImages folder
-  gt_folder: /mnt/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  # img_folder: /datagrid/personal/purkrmir/data/COCO/original/val2017/ # PATH to MOSE JPEGImages folder
-  # gt_folder: /datagrid/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  file_list_txt: null # Optional PATH to filelist containing a subset of videos to be used for training
-  multiplier: 2
 # Video transforms
 vos:
   train_transforms:
@@ -69,19 +60,19 @@ trainer:
   unfreeze_decoder: False
   model:
-    _target_: training.model.sam2.SAM2Train
     image_encoder:
-      _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
-        _target_: sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
-        _target_: sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
@@ -92,17 +83,17 @@ trainer:
         fpn_interp_model: nearest
     memory_attention:
-      _target_: sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
-        _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
@@ -113,7 +104,7 @@ trainer:
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
@@ -125,23 +116,23 @@ trainer:
       num_layers: 4
     memory_encoder:
-        _target_: sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
-          _target_: sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
-          _target_: sam2.modeling.memory_encoder.Fuser
           layer:
-            _target_: sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
@@ -325,7 +316,7 @@ trainer:
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
-        checkpoint_path: ./checkpoints/sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

   phases_per_epoch: 1
   num_epochs: 40
 # Video transforms
 vos:
   train_transforms:
   unfreeze_decoder: False
   model:
+    _target_: training.model.bboxmaskpose.sam2.sam2Train
     image_encoder:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
+        _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
+        _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
         fpn_interp_model: nearest
     memory_attention:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
+        _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
       num_layers: 4
     memory_encoder:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
           layer:
+            _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
+        checkpoint_path: ./checkpoints/bboxmaskpose.sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_COCO_finetune_prompt+decoder.yaml RENAMED Viewed

@@ -11,15 +11,6 @@ scratch:
   phases_per_epoch: 1
   num_epochs: 40
-dataset:
-  # PATHS to Dataset
-  img_folder: /mnt/personal/purkrmir/data/COCO/original/train2017/ # PATH to MOSE JPEGImages folder
-  gt_folder: /mnt/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  # img_folder: /datagrid/personal/purkrmir/data/COCO/original/train2017/ # PATH to MOSE JPEGImages folder
-  # gt_folder: /datagrid/personal/purkrmir/data/COCO/original/annotations/  # PATH to MOSE Annotations folder
-  file_list_txt: null # Optional PATH to filelist containing a subset of videos to be used for training
-  multiplier: 2
 # Video transforms
 vos:
   train_transforms:
@@ -69,19 +60,19 @@ trainer:
   unfreeze_decoder: True
   model:
-    _target_: training.model.sam2.SAM2Train
     image_encoder:
-      _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
-        _target_: sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
-        _target_: sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
@@ -92,17 +83,17 @@ trainer:
         fpn_interp_model: nearest
     memory_attention:
-      _target_: sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
-        _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
@@ -113,7 +104,7 @@ trainer:
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
@@ -125,23 +116,23 @@ trainer:
       num_layers: 4
     memory_encoder:
-        _target_: sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
-          _target_: sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
-          _target_: sam2.modeling.memory_encoder.Fuser
           layer:
-            _target_: sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
@@ -325,7 +316,7 @@ trainer:
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
-        checkpoint_path: ./checkpoints/sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

   phases_per_epoch: 1
   num_epochs: 40
 # Video transforms
 vos:
   train_transforms:
   unfreeze_decoder: True
   model:
+    _target_: training.model.bboxmaskpose.sam2.sam2Train
     image_encoder:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
+        _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
+        _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
         fpn_interp_model: nearest
     memory_attention:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
+        _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
       num_layers: 4
     memory_encoder:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
           layer:
+            _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
+        checkpoint_path: ./checkpoints/bboxmaskpose.sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

{sam2 → bboxmaskpose/sam2}/configs/sam2.1_training/sam2.1_hiera_b+_MOSE_finetune.yaml RENAMED Viewed

@@ -11,12 +11,6 @@ scratch:
   phases_per_epoch: 1
   num_epochs: 40
-dataset:
-  # PATHS to Dataset
-  img_folder: /datagrid/personal/purkrmir/data/MOSE/train/JPEGImages/ # PATH to MOSE JPEGImages folder
-  gt_folder: /datagrid/personal/purkrmir/data/MOSE/train/Annotations/  # PATH to MOSE Annotations folder
-  file_list_txt: training/assets/MOSE_sample_train_list.txt # Optional PATH to filelist containing a subset of videos to be used for training
-  multiplier: 2
 # Video transforms
 vos:
@@ -62,19 +56,19 @@ trainer:
   seed_value: 123
   model:
-    _target_: training.model.sam2.SAM2Train
     image_encoder:
-      _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
-        _target_: sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
-        _target_: sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
@@ -85,17 +79,17 @@ trainer:
         fpn_interp_model: nearest
     memory_attention:
-      _target_: sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
-        _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
@@ -106,7 +100,7 @@ trainer:
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
-          _target_: sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
@@ -118,23 +112,23 @@ trainer:
       num_layers: 4
     memory_encoder:
-        _target_: sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
-          _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
-          _target_: sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
-          _target_: sam2.modeling.memory_encoder.Fuser
           layer:
-            _target_: sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
@@ -318,7 +312,7 @@ trainer:
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
-        checkpoint_path: ./checkpoints/sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

   phases_per_epoch: 1
   num_epochs: 40
 # Video transforms
 vos:
   seed_value: 123
   model:
+    _target_: training.model.bboxmaskpose.sam2.sam2Train
     image_encoder:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
       scalp: 1
       trunk:
+        _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
         embed_dim: 112
         num_heads: 2
         drop_path_rate: 0.1
       neck:
+        _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 256
           normalize: true
           scale: null
         fpn_interp_model: nearest
     memory_attention:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
       d_model: 256
       pos_enc_at_input: true
       layer:
+        _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
         activation: relu
         dim_feedforward: 2048
         dropout: 0.1
         pos_enc_at_attn: false
         self_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           embedding_dim: 256
         pos_enc_at_cross_attn_keys: true
         pos_enc_at_cross_attn_queries: false
         cross_attention:
+          _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
           rope_theta: 10000.0
           feat_sizes: [64, 64]
           rope_k_repeat: True
       num_layers: 4
     memory_encoder:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
         out_dim: 64
         position_encoding:
+          _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
           num_pos_feats: 64
           normalize: true
           scale: null
           temperature: 10000
         mask_downsampler:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
           kernel_size: 3
           stride: 2
           padding: 1
         fuser:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
           layer:
+            _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
             dim: 256
             kernel_size: 7
             padding: 3
       state_dict:
         _target_: training.utils.checkpoint_utils.load_checkpoint_and_apply_kernels
+        checkpoint_path: ./checkpoints/bboxmaskpose.sam2.1_hiera_base_plus.pt # PATH to SAM 2.1 checkpoint
         ckpt_state_dict_keys: ['model']
 launcher:

{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_b+.yaml RENAMED Viewed

@@ -2,18 +2,18 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 112
       num_heads: 2
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -24,17 +24,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
@@ -45,7 +45,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
@@ -57,23 +57,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 112
       num_heads: 2
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_l.yaml RENAMED Viewed

@@ -2,12 +2,12 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 144
       num_heads: 2
       stages: [2, 6, 36, 4]
@@ -15,9 +15,9 @@ model:
       window_pos_embed_bkg_spatial_size: [7, 7]
       window_spec: [8, 4, 16, 8]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -28,17 +28,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
@@ -49,7 +49,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
@@ -61,23 +61,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 144
       num_heads: 2
       stages: [2, 6, 36, 4]
       window_pos_embed_bkg_spatial_size: [7, 7]
       window_spec: [8, 4, 16, 8]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_s.yaml RENAMED Viewed

@@ -2,21 +2,21 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 11, 2]
       global_att_blocks: [7, 10, 13]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -27,17 +27,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
@@ -48,7 +48,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
@@ -60,23 +60,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 11, 2]
       global_att_blocks: [7, 10, 13]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/configs/sam2/sam2_hiera_t.yaml RENAMED Viewed

@@ -2,21 +2,21 @@
 # Model
 model:
-  _target_: sam2.modeling.sam2_base.SAM2Base
   image_encoder:
-    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
-      _target_: sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 7, 2]
       global_att_blocks: [5, 7, 9]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
-      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
@@ -27,17 +27,17 @@ model:
       fpn_interp_model: nearest
   memory_attention:
-    _target_: sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
-      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
@@ -48,7 +48,7 @@ model:
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
-        _target_: sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
@@ -60,23 +60,23 @@ model:
     num_layers: 4
   memory_encoder:
-      _target_: sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
-        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
-        _target_: sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
-        _target_: sam2.modeling.memory_encoder.Fuser
         layer:
-          _target_: sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

 # Model
 model:
+  _target_: bboxmaskpose.sam2.modeling.sam2_base.SAM2Base
   image_encoder:
+    _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.ImageEncoder
     scalp: 1
     trunk:
+      _target_: bboxmaskpose.sam2.modeling.backbones.hieradet.Hiera
       embed_dim: 96
       num_heads: 1
       stages: [1, 2, 7, 2]
       global_att_blocks: [5, 7, 9]
       window_pos_embed_bkg_spatial_size: [7, 7]
     neck:
+      _target_: bboxmaskpose.sam2.modeling.backbones.image_encoder.FpnNeck
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 256
         normalize: true
         scale: null
       fpn_interp_model: nearest
   memory_attention:
+    _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttention
     d_model: 256
     pos_enc_at_input: true
     layer:
+      _target_: bboxmaskpose.sam2.modeling.memory_attention.MemoryAttentionLayer
       activation: relu
       dim_feedforward: 2048
       dropout: 0.1
       pos_enc_at_attn: false
       self_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         embedding_dim: 256
       pos_enc_at_cross_attn_keys: true
       pos_enc_at_cross_attn_queries: false
       cross_attention:
+        _target_: bboxmaskpose.sam2.modeling.sam.transformer.RoPEAttention
         rope_theta: 10000.0
         feat_sizes: [32, 32]
         rope_k_repeat: True
     num_layers: 4
   memory_encoder:
+      _target_: bboxmaskpose.sam2.modeling.memory_encoder.MemoryEncoder
       out_dim: 64
       position_encoding:
+        _target_: bboxmaskpose.sam2.modeling.position_encoding.PositionEmbeddingSine
         num_pos_feats: 64
         normalize: true
         scale: null
         temperature: 10000
       mask_downsampler:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.MaskDownSampler
         kernel_size: 3
         stride: 2
         padding: 1
       fuser:
+        _target_: bboxmaskpose.sam2.modeling.memory_encoder.Fuser
         layer:
+          _target_: bboxmaskpose.sam2.modeling.memory_encoder.CXBlock
           dim: 256
           kernel_size: 7
           padding: 3

{sam2 → bboxmaskpose/sam2}/csrc/connected_components.cu RENAMED Viewed

File without changes

{sam2 → bboxmaskpose/sam2}/distinctipy.py RENAMED Viewed

@@ -1,3 +1,5 @@
 import math
 import random
@@ -125,9 +127,7 @@ def color_distance(c1, c2):
     return distance
-def distinct_color(
-    exclude_colors, pastel_factor=0.0, n_attempts=1000, colorblind_type=None, rng=None
-):
     """
     Generate a colour as distinct as possible from the colours defined in exclude_colors
     Inspired by: https://gist.github.com/adewes/5884820
@@ -164,10 +164,7 @@ def distinct_color(
         return get_random_color(pastel_factor=pastel_factor, rng=rng)
     if colorblind_type:
-        exclude_colors = [
-            colorblind.colorblind_filter(color, colorblind_type)
-            for color in exclude_colors
-        ]
     max_distance = None
     best_color = None
@@ -181,9 +178,7 @@ def distinct_color(
                 else:
                     compare_color = color
-                distance_to_nearest = min(
-                    [color_distance(compare_color, c) for c in exclude_colors]
-                )
                 if (not max_distance) or (distance_to_nearest > max_distance):
                     max_distance = distance_to_nearest
@@ -202,9 +197,7 @@ def distinct_color(
             else:
                 compare_color = color
-            distance_to_nearest = min(
-                [color_distance(compare_color, c) for c in exclude_colors]
-            )
             if (not max_distance) or (distance_to_nearest > max_distance):
                 max_distance = distance_to_nearest
@@ -500,4 +493,4 @@ def get_colormap(list_of_colors, name="distinctipy"):
     cmap = matplotlib.colors.ListedColormap(list_of_colors, name=name)
-    return cmap

+# Adapted from the distinctipy repository (https://github.com/alan-turing-institute/distinctipy).
+# Original authors: distinctipy contributors. Included with minor modifications.
 import math
 import random
     return distance
+def distinct_color(exclude_colors, pastel_factor=0.0, n_attempts=1000, colorblind_type=None, rng=None):
     """
     Generate a colour as distinct as possible from the colours defined in exclude_colors
     Inspired by: https://gist.github.com/adewes/5884820
         return get_random_color(pastel_factor=pastel_factor, rng=rng)
     if colorblind_type:
+        exclude_colors = [colorblind.colorblind_filter(color, colorblind_type) for color in exclude_colors]
     max_distance = None
     best_color = None
                 else:
                     compare_color = color
+                distance_to_nearest = min([color_distance(compare_color, c) for c in exclude_colors])
                 if (not max_distance) or (distance_to_nearest > max_distance):
                     max_distance = distance_to_nearest
             else:
                 compare_color = color
+            distance_to_nearest = min([color_distance(compare_color, c) for c in exclude_colors])
             if (not max_distance) or (distance_to_nearest > max_distance):
                 max_distance = distance_to_nearest
     cmap = matplotlib.colors.ListedColormap(list_of_colors, name=name)
+    return cmap

{sam2 → bboxmaskpose/sam2}/modeling/__init__.py RENAMED Viewed

File without changes

{sam2 → bboxmaskpose/sam2}/modeling/backbones/__init__.py RENAMED Viewed

File without changes

{sam2 → bboxmaskpose/sam2}/modeling/backbones/hieradet.py RENAMED Viewed

@@ -11,15 +11,10 @@ from typing import List, Tuple, Union
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from iopath.common.file_io import g_pathmgr
-from sam2.modeling.backbones.utils import (
-    PatchEmbed,
-    window_partition,
-    window_unpartition,
-)
-from sam2.modeling.sam2_utils import DropPath, MLP
 def do_pool(x: torch.Tensor, pool: nn.Module, norm: nn.Module = None) -> torch.Tensor:
@@ -107,9 +102,7 @@ class MultiScaleBlock(nn.Module):
         self.pool, self.q_stride = None, q_stride
         if self.q_stride:
-            self.pool = nn.MaxPool2d(
-                kernel_size=q_stride, stride=q_stride, ceil_mode=False
-            )
         self.attn = MultiScaleAttention(
             dim,
@@ -218,16 +211,10 @@ class Hiera(nn.Module):
         # Windowed positional embedding (https://arxiv.org/abs/2311.05613)
         self.window_pos_embed_bkg_spatial_size = window_pos_embed_bkg_spatial_size
-        self.pos_embed = nn.Parameter(
-            torch.zeros(1, embed_dim, *self.window_pos_embed_bkg_spatial_size)
-        )
-        self.pos_embed_window = nn.Parameter(
-            torch.zeros(1, embed_dim, self.window_spec[0], self.window_spec[0])
-        )
-        dpr = [
-            x.item() for x in torch.linspace(0, drop_path_rate, depth)
-        ]  # stochastic depth decay rule
         cur_stage = 1
         self.blocks = nn.ModuleList()
@@ -259,11 +246,7 @@ class Hiera(nn.Module):
             embed_dim = dim_out
             self.blocks.append(block)
-        self.channel_list = (
-            [self.blocks[i].dim_out for i in self.stage_ends[::-1]]
-            if return_interm_layers
-            else [self.blocks[-1].dim_out]
-        )
         if weights_path is not None:
             with g_pathmgr.open(weights_path, "rb") as f:
@@ -274,9 +257,7 @@ class Hiera(nn.Module):
         h, w = hw
         window_embed = self.pos_embed_window
         pos_embed = F.interpolate(self.pos_embed, size=(h, w), mode="bicubic")
-        pos_embed = pos_embed + window_embed.tile(
-            [x // y for x, y in zip(pos_embed.shape, window_embed.shape)]
-        )
         pos_embed = pos_embed.permute(0, 2, 3, 1)
         return pos_embed
@@ -290,9 +271,7 @@ class Hiera(nn.Module):
         outputs = []
         for i, blk in enumerate(self.blocks):
             x = blk(x)
-            if (i == self.stage_ends[-1]) or (
-                i in self.stage_ends and self.return_interm_layers
-            ):
                 feats = x.permute(0, 3, 1, 2)
                 outputs.append(feats)

 import torch
 import torch.nn as nn
 import torch.nn.functional as F
+from bboxmaskpose.sam2.modeling.backbones.utils import PatchEmbed, window_partition, window_unpartition
+from bboxmaskpose.sam2.modeling.sam2_utils import MLP, DropPath
+from iopath.common.file_io import g_pathmgr
 def do_pool(x: torch.Tensor, pool: nn.Module, norm: nn.Module = None) -> torch.Tensor:
         self.pool, self.q_stride = None, q_stride
         if self.q_stride:
+            self.pool = nn.MaxPool2d(kernel_size=q_stride, stride=q_stride, ceil_mode=False)
         self.attn = MultiScaleAttention(
             dim,
         # Windowed positional embedding (https://arxiv.org/abs/2311.05613)
         self.window_pos_embed_bkg_spatial_size = window_pos_embed_bkg_spatial_size
+        self.pos_embed = nn.Parameter(torch.zeros(1, embed_dim, *self.window_pos_embed_bkg_spatial_size))
+        self.pos_embed_window = nn.Parameter(torch.zeros(1, embed_dim, self.window_spec[0], self.window_spec[0]))
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]  # stochastic depth decay rule
         cur_stage = 1
         self.blocks = nn.ModuleList()
             embed_dim = dim_out
             self.blocks.append(block)
+        self.channel_list = [self.blocks[i].dim_out for i in self.stage_ends[::-1]] if return_interm_layers else [self.blocks[-1].dim_out]
         if weights_path is not None:
             with g_pathmgr.open(weights_path, "rb") as f:
         h, w = hw
         window_embed = self.pos_embed_window
         pos_embed = F.interpolate(self.pos_embed, size=(h, w), mode="bicubic")
+        pos_embed = pos_embed + window_embed.tile([x // y for x, y in zip(pos_embed.shape, window_embed.shape)])
         pos_embed = pos_embed.permute(0, 2, 3, 1)
         return pos_embed
         outputs = []
         for i, blk in enumerate(self.blocks):
             x = blk(x)
+            if (i == self.stage_ends[-1]) or (i in self.stage_ends and self.return_interm_layers):
                 feats = x.permute(0, 3, 1, 2)
                 outputs.append(feats)

{sam2 → bboxmaskpose/sam2}/modeling/backbones/image_encoder.py RENAMED Viewed

@@ -117,9 +117,7 @@ class FpnNeck(nn.Module):
                     prev_features.to(dtype=torch.float32),
                     scale_factor=2.0,
                     mode=self.fpn_interp_model,
-                    align_corners=(
-                        None if self.fpn_interp_model == "nearest" else False
-                    ),
                     antialias=False,
                 )
                 prev_features = lateral_features + top_down_features

                     prev_features.to(dtype=torch.float32),
                     scale_factor=2.0,
                     mode=self.fpn_interp_model,
+                    align_corners=(None if self.fpn_interp_model == "nearest" else False),
                     antialias=False,
                 )
                 prev_features = lateral_features + top_down_features

{sam2 → bboxmaskpose/sam2}/modeling/backbones/utils.py RENAMED Viewed

@@ -50,9 +50,7 @@ def window_unpartition(windows, window_size, pad_hw, hw):
     Hp, Wp = pad_hw
     H, W = hw
     B = windows.shape[0] // (Hp * Wp // window_size // window_size)
-    x = windows.reshape(
-        B, Hp // window_size, Wp // window_size, window_size, window_size, -1
-    )
     x = x.permute(0, 1, 3, 2, 4, 5).reshape(B, Hp, Wp, -1)
     if Hp > H or Wp > W:
@@ -82,9 +80,7 @@ class PatchEmbed(nn.Module):
             embed_dim (int):  embed_dim (int): Patch embedding dimension.
         """
         super().__init__()
-        self.proj = nn.Conv2d(
-            in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding
-        )
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         x = self.proj(x)

     Hp, Wp = pad_hw
     H, W = hw
     B = windows.shape[0] // (Hp * Wp // window_size // window_size)
+    x = windows.reshape(B, Hp // window_size, Wp // window_size, window_size, window_size, -1)
     x = x.permute(0, 1, 3, 2, 4, 5).reshape(B, Hp, Wp, -1)
     if Hp > H or Wp > W:
             embed_dim (int):  embed_dim (int): Patch embedding dimension.
         """
         super().__init__()
+        self.proj = nn.Conv2d(in_chans, embed_dim, kernel_size=kernel_size, stride=stride, padding=padding)
     def forward(self, x: torch.Tensor) -> torch.Tensor:
         x = self.proj(x)

{sam2 → bboxmaskpose/sam2}/modeling/memory_attention.py RENAMED Viewed

@@ -7,11 +7,10 @@
 from typing import Optional
 import torch
-from torch import nn, Tensor
-from sam2.modeling.sam.transformer import RoPEAttention
-from sam2.modeling.sam2_utils import get_activation_fn, get_clones
 class MemoryAttentionLayer(nn.Module):
@@ -132,9 +131,7 @@ class MemoryAttention(nn.Module):
                 curr_pos[0],
             )
-        assert (
-            curr.shape[1] == memory.shape[1]
-        ), "Batch size must be the same for curr and memory"
         output = curr
         if self.pos_enc_at_input and curr_pos is not None:

 from typing import Optional
 import torch
+from torch import Tensor, nn
+from bboxmaskpose.sam2.modeling.sam2_utils import get_activation_fn, get_clones
+from bboxmaskpose.sam2.modeling.sam.transformer import RoPEAttention
 class MemoryAttentionLayer(nn.Module):
                 curr_pos[0],
             )
+        assert curr.shape[1] == memory.shape[1], "Batch size must be the same for curr and memory"
         output = curr
         if self.pos_enc_at_input and curr_pos is not None:

{sam2 → bboxmaskpose/sam2}/modeling/memory_encoder.py RENAMED Viewed

@@ -11,7 +11,7 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from sam2.modeling.sam2_utils import DropPath, get_clones, LayerNorm2d
 class MaskDownSampler(nn.Module):
@@ -89,16 +89,10 @@ class CXBlock(nn.Module):
             groups=dim if use_dwconv else 1,
         )  # depthwise conv
         self.norm = LayerNorm2d(dim, eps=1e-6)
-        self.pwconv1 = nn.Linear(
-            dim, 4 * dim
-        )  # pointwise/1x1 convs, implemented with linear layers
         self.act = nn.GELU()
         self.pwconv2 = nn.Linear(4 * dim, dim)
-        self.gamma = (
-            nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True)
-            if layer_scale_init_value > 0
-            else None
-        )
         self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
     def forward(self, x):

 import torch.nn as nn
 import torch.nn.functional as F
+from bboxmaskpose.sam2.modeling.sam2_utils import DropPath, LayerNorm2d, get_clones
 class MaskDownSampler(nn.Module):
             groups=dim if use_dwconv else 1,
         )  # depthwise conv
         self.norm = LayerNorm2d(dim, eps=1e-6)
+        self.pwconv1 = nn.Linear(dim, 4 * dim)  # pointwise/1x1 convs, implemented with linear layers
         self.act = nn.GELU()
         self.pwconv2 = nn.Linear(4 * dim, dim)
+        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim)), requires_grad=True) if layer_scale_init_value > 0 else None
         self.drop_path = DropPath(drop_path) if drop_path > 0.0 else nn.Identity()
     def forward(self, x):

{sam2 → bboxmaskpose/sam2}/modeling/position_encoding.py RENAMED Viewed

@@ -8,7 +8,6 @@ import math
 from typing import Any, Optional, Tuple
 import numpy as np
 import torch
 from torch import nn
@@ -61,12 +60,8 @@ class PositionEmbeddingSine(nn.Module):
         pos_x = x_embed[:, None] / dim_t
         pos_y = y_embed[:, None] / dim_t
-        pos_x = torch.stack(
-            (pos_x[:, 0::2].sin(), pos_x[:, 1::2].cos()), dim=2
-        ).flatten(1)
-        pos_y = torch.stack(
-            (pos_y[:, 0::2].sin(), pos_y[:, 1::2].cos()), dim=2
-        ).flatten(1)
         return pos_x, pos_y
     @torch.no_grad()
@@ -92,16 +87,8 @@ class PositionEmbeddingSine(nn.Module):
         if cache_key in self.cache:
             return self.cache[cache_key].to(device)[None].repeat(B, 1, 1, 1)
-        y_embed = (
-            torch.arange(1, H + 1, dtype=torch.float32, device=device)
-            .view(1, -1, 1)
-            .repeat(B, 1, W)
-        )
-        x_embed = (
-            torch.arange(1, W + 1, dtype=torch.float32, device=device)
-            .view(1, 1, -1)
-            .repeat(B, H, 1)
-        )
         if self.normalize:
             eps = 1e-6
@@ -113,12 +100,8 @@ class PositionEmbeddingSine(nn.Module):
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
-        pos_x = torch.stack(
-            (pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4
-        ).flatten(3)
-        pos_y = torch.stack(
-            (pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4
-        ).flatten(3)
         pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
         self.cache[cache_key] = pos[0]
         return pos
@@ -166,9 +149,7 @@ class PositionEmbeddingRandom(nn.Module):
         pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1))
         return pe.permute(2, 0, 1)  # C x H x W
-    def forward_with_coords(
-        self, coords_input: torch.Tensor, image_size: Tuple[int, int]
-    ) -> torch.Tensor:
         """Positionally encode points that are not normalized to [0,1]."""
         coords = coords_input.clone()
         coords[:, :, 0] = coords[:, :, 0] / image_size[1]
@@ -216,11 +197,7 @@ def apply_rotary_enc(
     repeat_freqs_k: bool = False,
 ):
     xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
-    xk_ = (
-        torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
-        if xk.shape[-2] != 0
-        else None
-    )
     freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
     xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
     if xk_ is None:

 from typing import Any, Optional, Tuple
 import numpy as np
 import torch
 from torch import nn
         pos_x = x_embed[:, None] / dim_t
         pos_y = y_embed[:, None] / dim_t
+        pos_x = torch.stack((pos_x[:, 0::2].sin(), pos_x[:, 1::2].cos()), dim=2).flatten(1)
+        pos_y = torch.stack((pos_y[:, 0::2].sin(), pos_y[:, 1::2].cos()), dim=2).flatten(1)
         return pos_x, pos_y
     @torch.no_grad()
         if cache_key in self.cache:
             return self.cache[cache_key].to(device)[None].repeat(B, 1, 1, 1)
+        y_embed = torch.arange(1, H + 1, dtype=torch.float32, device=device).view(1, -1, 1).repeat(B, 1, W)
+        x_embed = torch.arange(1, W + 1, dtype=torch.float32, device=device).view(1, 1, -1).repeat(B, H, 1)
         if self.normalize:
             eps = 1e-6
         pos_x = x_embed[:, :, :, None] / dim_t
         pos_y = y_embed[:, :, :, None] / dim_t
+        pos_x = torch.stack((pos_x[:, :, :, 0::2].sin(), pos_x[:, :, :, 1::2].cos()), dim=4).flatten(3)
+        pos_y = torch.stack((pos_y[:, :, :, 0::2].sin(), pos_y[:, :, :, 1::2].cos()), dim=4).flatten(3)
         pos = torch.cat((pos_y, pos_x), dim=3).permute(0, 3, 1, 2)
         self.cache[cache_key] = pos[0]
         return pos
         pe = self._pe_encoding(torch.stack([x_embed, y_embed], dim=-1))
         return pe.permute(2, 0, 1)  # C x H x W
+    def forward_with_coords(self, coords_input: torch.Tensor, image_size: Tuple[int, int]) -> torch.Tensor:
         """Positionally encode points that are not normalized to [0,1]."""
         coords = coords_input.clone()
         coords[:, :, 0] = coords[:, :, 0] / image_size[1]
     repeat_freqs_k: bool = False,
 ):
     xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
+    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2)) if xk.shape[-2] != 0 else None
     freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
     xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
     if xk_ is None:

{sam2 → bboxmaskpose/sam2}/modeling/sam/__init__.py RENAMED Viewed

File without changes

{sam2 → bboxmaskpose/sam2}/modeling/sam/mask_decoder.py RENAMED Viewed

@@ -9,7 +9,7 @@ from typing import List, Optional, Tuple, Type
 import torch
 from torch import nn
-from sam2.modeling.sam2_utils import LayerNorm2d, MLP
 class MaskDecoder(nn.Module):
@@ -63,30 +63,19 @@ class MaskDecoder(nn.Module):
         self.use_multimask_token_for_obj_ptr = use_multimask_token_for_obj_ptr
         self.output_upscaling = nn.Sequential(
-            nn.ConvTranspose2d(
-                transformer_dim, transformer_dim // 4, kernel_size=2, stride=2
-            ),
             LayerNorm2d(transformer_dim // 4),
             activation(),
-            nn.ConvTranspose2d(
-                transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2
-            ),
             activation(),
         )
         self.use_high_res_features = use_high_res_features
         if use_high_res_features:
-            self.conv_s0 = nn.Conv2d(
-                transformer_dim, transformer_dim // 8, kernel_size=1, stride=1
-            )
-            self.conv_s1 = nn.Conv2d(
-                transformer_dim, transformer_dim // 4, kernel_size=1, stride=1
-            )
         self.output_hypernetworks_mlps = nn.ModuleList(
-            [
-                MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3)
-                for i in range(self.num_mask_tokens)
-            ]
         )
         self.iou_prediction_head = MLP(
@@ -188,12 +177,8 @@ class MaskDecoder(nn.Module):
             )
             s = 1
         else:
-            output_tokens = torch.cat(
-                [self.iou_token.weight, self.mask_tokens.weight], dim=0
-            )
-        output_tokens = output_tokens.unsqueeze(0).expand(
-            sparse_prompt_embeddings.size(0), -1, -1
-        )
         tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1)
         # Expand per-image data in batch direction to be per-mask
@@ -203,9 +188,7 @@ class MaskDecoder(nn.Module):
             assert image_embeddings.shape[0] == tokens.shape[0]
             src = image_embeddings
         src = src + dense_prompt_embeddings
-        assert (
-            image_pe.size(0) == 1
-        ), "image_pe should have size 1 in batch dim (from `get_dense_pe()`)"
         pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0)
         b, c, h, w = src.shape
@@ -226,9 +209,7 @@ class MaskDecoder(nn.Module):
         hyper_in_list: List[torch.Tensor] = []
         for i in range(self.num_mask_tokens):
-            hyper_in_list.append(
-                self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :])
-            )
         hyper_in = torch.stack(hyper_in_list, dim=1)
         b, c, h, w = upscaled_embedding.shape
         masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)
@@ -267,9 +248,7 @@ class MaskDecoder(nn.Module):
         multimask_logits = all_mask_logits[:, 1:, :, :]
         multimask_iou_scores = all_iou_scores[:, 1:]
         best_scores_inds = torch.argmax(multimask_iou_scores, dim=-1)
-        batch_inds = torch.arange(
-            multimask_iou_scores.size(0), device=all_iou_scores.device
-        )
         best_multimask_logits = multimask_logits[batch_inds, best_scores_inds]
         best_multimask_logits = best_multimask_logits.unsqueeze(1)
         best_multimask_iou_scores = multimask_iou_scores[batch_inds, best_scores_inds]

 import torch
 from torch import nn
+from bboxmaskpose.sam2.modeling.sam2_utils import MLP, LayerNorm2d
 class MaskDecoder(nn.Module):
         self.use_multimask_token_for_obj_ptr = use_multimask_token_for_obj_ptr
         self.output_upscaling = nn.Sequential(
+            nn.ConvTranspose2d(transformer_dim, transformer_dim // 4, kernel_size=2, stride=2),
             LayerNorm2d(transformer_dim // 4),
             activation(),
+            nn.ConvTranspose2d(transformer_dim // 4, transformer_dim // 8, kernel_size=2, stride=2),
             activation(),
         )
         self.use_high_res_features = use_high_res_features
         if use_high_res_features:
+            self.conv_s0 = nn.Conv2d(transformer_dim, transformer_dim // 8, kernel_size=1, stride=1)
+            self.conv_s1 = nn.Conv2d(transformer_dim, transformer_dim // 4, kernel_size=1, stride=1)
         self.output_hypernetworks_mlps = nn.ModuleList(
+            [MLP(transformer_dim, transformer_dim, transformer_dim // 8, 3) for i in range(self.num_mask_tokens)]
         )
         self.iou_prediction_head = MLP(
             )
             s = 1
         else:
+            output_tokens = torch.cat([self.iou_token.weight, self.mask_tokens.weight], dim=0)
+        output_tokens = output_tokens.unsqueeze(0).expand(sparse_prompt_embeddings.size(0), -1, -1)
         tokens = torch.cat((output_tokens, sparse_prompt_embeddings), dim=1)
         # Expand per-image data in batch direction to be per-mask
             assert image_embeddings.shape[0] == tokens.shape[0]
             src = image_embeddings
         src = src + dense_prompt_embeddings
+        assert image_pe.size(0) == 1, "image_pe should have size 1 in batch dim (from `get_dense_pe()`)"
         pos_src = torch.repeat_interleave(image_pe, tokens.shape[0], dim=0)
         b, c, h, w = src.shape
         hyper_in_list: List[torch.Tensor] = []
         for i in range(self.num_mask_tokens):
+            hyper_in_list.append(self.output_hypernetworks_mlps[i](mask_tokens_out[:, i, :]))
         hyper_in = torch.stack(hyper_in_list, dim=1)
         b, c, h, w = upscaled_embedding.shape
         masks = (hyper_in @ upscaled_embedding.view(b, c, h * w)).view(b, -1, h, w)
         multimask_logits = all_mask_logits[:, 1:, :, :]
         multimask_iou_scores = all_iou_scores[:, 1:]
         best_scores_inds = torch.argmax(multimask_iou_scores, dim=-1)
+        batch_inds = torch.arange(multimask_iou_scores.size(0), device=all_iou_scores.device)
         best_multimask_logits = multimask_logits[batch_inds, best_scores_inds]
         best_multimask_logits = best_multimask_logits.unsqueeze(1)
         best_multimask_iou_scores = multimask_iou_scores[batch_inds, best_scores_inds]

{sam2 → bboxmaskpose/sam2}/modeling/sam/pose_encoder.py RENAMED Viewed

@@ -9,9 +9,8 @@ from typing import Optional, Tuple, Type
 import torch
 from torch import nn
-from sam2.modeling.position_encoding import PositionEmbeddingRandom
-from sam2.modeling.sam2_utils import LayerNorm2d
 class PoseEncoder(nn.Module):
@@ -44,9 +43,7 @@ class PoseEncoder(nn.Module):
         self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)
         self.num_point_embeddings: int = 17  # 17 COCO keypoints
-        point_embeddings = [
-            nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)
-        ]
         self.point_embeddings = nn.ModuleList(point_embeddings)
         self.not_a_point_embed = nn.Embedding(1, embed_dim)
@@ -89,17 +86,12 @@ class PoseEncoder(nn.Module):
             padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
             points = torch.cat([points, padding_point], dim=1)
             labels = torch.cat([labels, padding_label], dim=1)
-        point_embedding = self.pe_layer.forward_with_coords(
-            points, self.input_image_size
-        )
         kpt_embeddings = torch.cat([self.point_embeddings[i].weight for i in range(self.num_point_embeddings)], dim=0)
         negative_embedding = torch.zeros_like(point_embedding) + self.not_a_point_embed.weight
         positive_embedding = point_embedding + kpt_embeddings
-        weighted_embedding = (
-            positive_embedding * labels.unsqueeze(-1).float() +
-            negative_embedding * (1 - labels.unsqueeze(-1).float())
-        )
         point_embedding = torch.where(
             (labels == 0).unsqueeze(-1),
@@ -112,9 +104,7 @@ class PoseEncoder(nn.Module):
         """Embeds box prompts."""
         boxes = boxes + 0.5  # Shift to center of pixel
         coords = boxes.reshape(-1, 2, 2)
-        corner_embedding = self.pe_layer.forward_with_coords(
-            coords, self.input_image_size
-        )
         corner_embedding[:, 0, :] += self.point_embeddings[2].weight
         corner_embedding[:, 1, :] += self.point_embeddings[3].weight
         return corner_embedding
@@ -170,9 +160,7 @@ class PoseEncoder(nn.Module):
             Bx(embed_dim)x(embed_H)x(embed_W)
         """
         bs = self._get_batch_size(points, boxes, masks)
-        sparse_embeddings = torch.empty(
-            (bs, 0, self.embed_dim), device=self._get_device()
-        )
         if points is not None:
             coords, labels = points
             point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))

 import torch
 from torch import nn
+from bboxmaskpose.sam2.modeling.position_encoding import PositionEmbeddingRandom
+from bboxmaskpose.sam2.modeling.sam2_utils import LayerNorm2d
 class PoseEncoder(nn.Module):
         self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)
         self.num_point_embeddings: int = 17  # 17 COCO keypoints
+        point_embeddings = [nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)]
         self.point_embeddings = nn.ModuleList(point_embeddings)
         self.not_a_point_embed = nn.Embedding(1, embed_dim)
             padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
             points = torch.cat([points, padding_point], dim=1)
             labels = torch.cat([labels, padding_label], dim=1)
+        point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size)
         kpt_embeddings = torch.cat([self.point_embeddings[i].weight for i in range(self.num_point_embeddings)], dim=0)
         negative_embedding = torch.zeros_like(point_embedding) + self.not_a_point_embed.weight
         positive_embedding = point_embedding + kpt_embeddings
+        weighted_embedding = positive_embedding * labels.unsqueeze(-1).float() + negative_embedding * (1 - labels.unsqueeze(-1).float())
         point_embedding = torch.where(
             (labels == 0).unsqueeze(-1),
         """Embeds box prompts."""
         boxes = boxes + 0.5  # Shift to center of pixel
         coords = boxes.reshape(-1, 2, 2)
+        corner_embedding = self.pe_layer.forward_with_coords(coords, self.input_image_size)
         corner_embedding[:, 0, :] += self.point_embeddings[2].weight
         corner_embedding[:, 1, :] += self.point_embeddings[3].weight
         return corner_embedding
             Bx(embed_dim)x(embed_H)x(embed_W)
         """
         bs = self._get_batch_size(points, boxes, masks)
+        sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device())
         if points is not None:
             coords, labels = points
             point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))

{sam2 → bboxmaskpose/sam2}/modeling/sam/prompt_encoder.py RENAMED Viewed

@@ -6,12 +6,12 @@
 from typing import Optional, Tuple, Type
 import torch
 from torch import nn
-from sam2.modeling.position_encoding import PositionEmbeddingRandom
-from sam2.modeling.sam2_utils import LayerNorm2d
 class PromptEncoder(nn.Module):
@@ -22,6 +22,7 @@ class PromptEncoder(nn.Module):
         input_image_size: Tuple[int, int],
         mask_in_chans: int,
         activation: Type[nn.Module] = nn.GELU,
     ) -> None:
         """
         Encodes prompts for input to SAM's mask decoder.
@@ -44,9 +45,7 @@ class PromptEncoder(nn.Module):
         self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)
         self.num_point_embeddings: int = 4  # pos/neg point + 2 box corners
-        point_embeddings = [
-            nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)
-        ]
         self.point_embeddings = nn.ModuleList(point_embeddings)
         self.not_a_point_embed = nn.Embedding(1, embed_dim)
@@ -63,6 +62,7 @@ class PromptEncoder(nn.Module):
             activation(),
             nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1),
         )
         self.no_mask_embed = nn.Embedding(1, embed_dim)
     def get_dense_pe(self) -> torch.Tensor:
@@ -76,45 +76,41 @@ class PromptEncoder(nn.Module):
         """
         return self.pe_layer(self.image_embedding_size).unsqueeze(0)
-    def _embed_points(
-        self,
-        points: torch.Tensor,
-        labels: torch.Tensor,
-        pad: bool,
     ) -> torch.Tensor:
         """Embeds point prompts."""
         points = points + 0.5  # Shift to center of pixel
         if pad:
             padding_point = torch.zeros((points.shape[0], 1, 2), device=points.device)
             padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
             points = torch.cat([points, padding_point], dim=1)
             labels = torch.cat([labels, padding_label], dim=1)
-        point_embedding = self.pe_layer.forward_with_coords(
-            points, self.input_image_size
-        )
         point_embedding = torch.where(
             (labels == -1).unsqueeze(-1),
             torch.zeros_like(point_embedding) + self.not_a_point_embed.weight,
             point_embedding,
         )
-        point_embedding = torch.where(
             (labels == 0).unsqueeze(-1),
             point_embedding + self.point_embeddings[0].weight,
             point_embedding,
         )
         point_embedding = torch.where(
-            (labels == 1).unsqueeze(-1),
             point_embedding + self.point_embeddings[1].weight,
             point_embedding,
         )
         point_embedding = torch.where(
-            (labels == 2).unsqueeze(-1),
             point_embedding + self.point_embeddings[2].weight,
             point_embedding,
         )
         point_embedding = torch.where(
-            (labels == 3).unsqueeze(-1),
             point_embedding + self.point_embeddings[3].weight,
             point_embedding,
         )
@@ -124,9 +120,7 @@ class PromptEncoder(nn.Module):
         """Embeds box prompts."""
         boxes = boxes + 0.5  # Shift to center of pixel
         coords = boxes.reshape(-1, 2, 2)
-        corner_embedding = self.pe_layer.forward_with_coords(
-            coords, self.input_image_size
-        )
         corner_embedding[:, 0, :] += self.point_embeddings[2].weight
         corner_embedding[:, 1, :] += self.point_embeddings[3].weight
         return corner_embedding
@@ -160,9 +154,9 @@ class PromptEncoder(nn.Module):
     def forward(
         self,
         points: Optional[Tuple[torch.Tensor, torch.Tensor]],
-        # skeletons: Optional[Tuple[torch.Tensor, torch.Tensor]],
         boxes: Optional[torch.Tensor],
         masks: Optional[torch.Tensor],
     ) -> Tuple[torch.Tensor, torch.Tensor]:
         """
         Embeds different types of prompts, returning both sparse and dense
@@ -182,12 +176,13 @@ class PromptEncoder(nn.Module):
             Bx(embed_dim)x(embed_H)x(embed_W)
         """
         bs = self._get_batch_size(points, boxes, masks)
-        sparse_embeddings = torch.empty(
-            (bs, 0, self.embed_dim), device=self._get_device()
-        )
         if points is not None:
             coords, labels = points
-            point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))
             sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1)
         if boxes is not None:
             box_embeddings = self._embed_boxes(boxes)

 from typing import Optional, Tuple, Type
+import numpy as np
 import torch
 from torch import nn
+from bboxmaskpose.sam2.modeling.position_encoding import PositionEmbeddingRandom
+from bboxmaskpose.sam2.modeling.sam2_utils import LayerNorm2d
 class PromptEncoder(nn.Module):
         input_image_size: Tuple[int, int],
         mask_in_chans: int,
         activation: Type[nn.Module] = nn.GELU,
+        n_kpts_encoder: int = -1,
     ) -> None:
         """
         Encodes prompts for input to SAM's mask decoder.
         self.pe_layer = PositionEmbeddingRandom(embed_dim // 2)
         self.num_point_embeddings: int = 4  # pos/neg point + 2 box corners
+        point_embeddings = [nn.Embedding(1, embed_dim) for i in range(self.num_point_embeddings)]
         self.point_embeddings = nn.ModuleList(point_embeddings)
         self.not_a_point_embed = nn.Embedding(1, embed_dim)
             activation(),
             nn.Conv2d(mask_in_chans, embed_dim, kernel_size=1),
         )
+        self.n_kpts_encoder = n_kpts_encoder
         self.no_mask_embed = nn.Embedding(1, embed_dim)
     def get_dense_pe(self) -> torch.Tensor:
         """
         return self.pe_layer(self.image_embedding_size).unsqueeze(0)
+    def _embed_points(  ## embeds the points into a high-dimensional space (e.g., 256-dim) using learned embeddings
+        self, points: torch.Tensor, labels: torch.Tensor, pad: bool, normalize: bool
     ) -> torch.Tensor:
         """Embeds point prompts."""
+        # print("EMBED points ", points) # KPTS OUTPUT
         points = points + 0.5  # Shift to center of pixel
         if pad:
             padding_point = torch.zeros((points.shape[0], 1, 2), device=points.device)
             padding_label = -torch.ones((labels.shape[0], 1), device=labels.device)
             points = torch.cat([points, padding_point], dim=1)
             labels = torch.cat([labels, padding_label], dim=1)
+        point_embedding = self.pe_layer.forward_with_coords(points, self.input_image_size)
         point_embedding = torch.where(
             (labels == -1).unsqueeze(-1),
             torch.zeros_like(point_embedding) + self.not_a_point_embed.weight,
             point_embedding,
         )
+        point_embedding = torch.where(  ## negative pts
             (labels == 0).unsqueeze(-1),
             point_embedding + self.point_embeddings[0].weight,
             point_embedding,
         )
         point_embedding = torch.where(
+            (labels == 1).unsqueeze(-1),  ## positive pts
             point_embedding + self.point_embeddings[1].weight,
             point_embedding,
         )
         point_embedding = torch.where(
+            (labels == 2).unsqueeze(-1),  ## bbox top left
             point_embedding + self.point_embeddings[2].weight,
             point_embedding,
         )
         point_embedding = torch.where(
+            (labels == 3).unsqueeze(-1),  ## bbox bottom right
             point_embedding + self.point_embeddings[3].weight,
             point_embedding,
         )
         """Embeds box prompts."""
         boxes = boxes + 0.5  # Shift to center of pixel
         coords = boxes.reshape(-1, 2, 2)
+        corner_embedding = self.pe_layer.forward_with_coords(coords, self.input_image_size)
         corner_embedding[:, 0, :] += self.point_embeddings[2].weight
         corner_embedding[:, 1, :] += self.point_embeddings[3].weight
         return corner_embedding
     def forward(
         self,
         points: Optional[Tuple[torch.Tensor, torch.Tensor]],
         boxes: Optional[torch.Tensor],
         masks: Optional[torch.Tensor],
+        normalize: bool = True,
     ) -> Tuple[torch.Tensor, torch.Tensor]:
         """
         Embeds different types of prompts, returning both sparse and dense
             Bx(embed_dim)x(embed_H)x(embed_W)
         """
         bs = self._get_batch_size(points, boxes, masks)
+        sparse_embeddings = torch.empty((bs, 0, self.embed_dim), device=self._get_device())
         if points is not None:
             coords, labels = points
+            coords = coords.to(self._get_device())
+            labels = labels.to(self._get_device())
+            point_embeddings = self._embed_points(coords, labels, pad=(boxes is None), normalize=normalize)
+            # point_embeddings = self._embed_points(coords, labels, pad=(boxes is None))
             sparse_embeddings = torch.cat([sparse_embeddings, point_embeddings], dim=1)
         if boxes is not None:
             box_embeddings = self._embed_boxes(boxes)

{sam2 → bboxmaskpose/sam2}/modeling/sam/transformer.py RENAMED Viewed

@@ -10,10 +10,10 @@ from typing import Tuple, Type
 import torch
 import torch.nn.functional as F
-from torch import nn, Tensor
-from sam2.modeling.position_encoding import apply_rotary_enc, compute_axial_cis
-from sam2.modeling.sam2_utils import MLP
 class TwoWayTransformer(nn.Module):
@@ -57,9 +57,7 @@ class TwoWayTransformer(nn.Module):
                 )
             )
-        self.final_attn_token_to_image = Attention(
-            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
-        )
         self.norm_final_attn = nn.LayerNorm(embedding_dim)
     def forward(
@@ -136,26 +134,18 @@ class TwoWayAttentionBlock(nn.Module):
         self.self_attn = Attention(embedding_dim, num_heads)
         self.norm1 = nn.LayerNorm(embedding_dim)
-        self.cross_attn_token_to_image = Attention(
-            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
-        )
         self.norm2 = nn.LayerNorm(embedding_dim)
-        self.mlp = MLP(
-            embedding_dim, mlp_dim, embedding_dim, num_layers=2, activation=activation
-        )
         self.norm3 = nn.LayerNorm(embedding_dim)
         self.norm4 = nn.LayerNorm(embedding_dim)
-        self.cross_attn_image_to_token = Attention(
-            embedding_dim, num_heads, downsample_rate=attention_downsample_rate
-        )
         self.skip_first_layer_pe = skip_first_layer_pe
-    def forward(
-        self, queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor
-    ) -> Tuple[Tensor, Tensor]:
         # Self attention block
         if self.skip_first_layer_pe:
             queries = self.self_attn(q=queries, k=queries, v=queries)
@@ -206,9 +196,7 @@ class Attention(nn.Module):
         self.kv_in_dim = kv_in_dim if kv_in_dim is not None else embedding_dim
         self.internal_dim = embedding_dim // downsample_rate
         self.num_heads = num_heads
-        assert (
-            self.internal_dim % num_heads == 0
-        ), "num_heads must divide embedding_dim."
         self.q_proj = nn.Linear(embedding_dim, self.internal_dim)
         self.k_proj = nn.Linear(self.kv_in_dim, self.internal_dim)
@@ -263,18 +251,12 @@ class RoPEAttention(Attention):
     ):
         super().__init__(*args, **kwargs)
-        self.compute_cis = partial(
-            compute_axial_cis, dim=self.internal_dim // self.num_heads, theta=rope_theta
-        )
         freqs_cis = self.compute_cis(end_x=feat_sizes[0], end_y=feat_sizes[1])
-        self.freqs_cis = (
-            freqs_cis.to("cuda") if torch.cuda.is_available() else freqs_cis
-        )
         self.rope_k_repeat = rope_k_repeat
-    def forward(
-        self, q: Tensor, k: Tensor, v: Tensor, num_k_exclude_rope: int = 0
-    ) -> Tensor:
         # Input projections
         q = self.q_proj(q)
         k = self.k_proj(k)

 import torch
 import torch.nn.functional as F
+from torch import Tensor, nn
+from bboxmaskpose.sam2.modeling.position_encoding import apply_rotary_enc, compute_axial_cis
+from bboxmaskpose.sam2.modeling.sam2_utils import MLP
 class TwoWayTransformer(nn.Module):
                 )
             )
+        self.final_attn_token_to_image = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)
         self.norm_final_attn = nn.LayerNorm(embedding_dim)
     def forward(
         self.self_attn = Attention(embedding_dim, num_heads)
         self.norm1 = nn.LayerNorm(embedding_dim)
+        self.cross_attn_token_to_image = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)
         self.norm2 = nn.LayerNorm(embedding_dim)
+        self.mlp = MLP(embedding_dim, mlp_dim, embedding_dim, num_layers=2, activation=activation)
         self.norm3 = nn.LayerNorm(embedding_dim)
         self.norm4 = nn.LayerNorm(embedding_dim)
+        self.cross_attn_image_to_token = Attention(embedding_dim, num_heads, downsample_rate=attention_downsample_rate)
         self.skip_first_layer_pe = skip_first_layer_pe
+    def forward(self, queries: Tensor, keys: Tensor, query_pe: Tensor, key_pe: Tensor) -> Tuple[Tensor, Tensor]:
         # Self attention block
         if self.skip_first_layer_pe:
             queries = self.self_attn(q=queries, k=queries, v=queries)
         self.kv_in_dim = kv_in_dim if kv_in_dim is not None else embedding_dim
         self.internal_dim = embedding_dim // downsample_rate
         self.num_heads = num_heads
+        assert self.internal_dim % num_heads == 0, "num_heads must divide embedding_dim."
         self.q_proj = nn.Linear(embedding_dim, self.internal_dim)
         self.k_proj = nn.Linear(self.kv_in_dim, self.internal_dim)
     ):
         super().__init__(*args, **kwargs)
+        self.compute_cis = partial(compute_axial_cis, dim=self.internal_dim // self.num_heads, theta=rope_theta)
         freqs_cis = self.compute_cis(end_x=feat_sizes[0], end_y=feat_sizes[1])
+        self.freqs_cis = freqs_cis.to("cuda") if torch.cuda.is_available() else freqs_cis
         self.rope_k_repeat = rope_k_repeat
+    def forward(self, q: Tensor, k: Tensor, v: Tensor, num_k_exclude_rope: int = 0) -> Tensor:
         # Input projections
         q = self.q_proj(q)
         k = self.k_proj(k)

{sam2 → bboxmaskpose/sam2}/modeling/sam2_base.py RENAMED Viewed

@@ -4,20 +4,15 @@
 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
-from loguru import logger
 import torch
 import torch.distributed
 import torch.nn.functional as F
 from torch.nn.init import trunc_normal_
-from sam2.modeling.sam.mask_decoder import MaskDecoder
-from sam2.modeling.sam.prompt_encoder import PromptEncoder
-from sam2.modeling.sam.transformer import TwoWayTransformer
-from sam2.modeling.sam2_utils import get_1d_sine_pe, MLP, select_closest_cond_frames
-from sam2.utils.kalman_filter import KalmanFilter
 # a large negative value as a placeholder score for missing objects
 NO_OBJ_SCORE = -1024.0
@@ -97,19 +92,10 @@ class SAM2Base(torch.nn.Module):
         # extra arguments used to construct the SAM mask decoder; if not None, it should be a dict of kwargs to be passed into `MaskDecoder` class.
         sam_mask_decoder_extra_args=None,
         compile_image_encoder: bool = False,
-        # Whether to use SAMURAI or original SAM 2
-        samurai_mode: bool = False,
-        # Hyperparameters for SAMURAI
-        stable_frames_threshold: int = 15,
-        stable_ious_threshold: float = 0.3,
-        min_obj_score_logits: float = -1,
-        kf_score_weight: float = 0.15,
-        memory_bank_iou_threshold: float = 0.5,
-        memory_bank_obj_score_threshold: float = 0.0,
-        memory_bank_kf_score_threshold: float = 0.0,
     ):
         super().__init__()
         # Part 1: the image backbone
         self.image_encoder = image_encoder
         # Use level 0, 1, 2 for high-res setting, or just level 2 for the default setting
@@ -137,16 +123,12 @@ class SAM2Base(torch.nn.Module):
         # Part 3: memory encoder for the previous frame's outputs
         self.memory_encoder = memory_encoder
         self.mem_dim = self.hidden_dim
-        if hasattr(self.memory_encoder, "out_proj") and hasattr(
-            self.memory_encoder.out_proj, "weight"
-        ):
             # if there is compression of memories along channel dim
             self.mem_dim = self.memory_encoder.out_proj.weight.shape[0]
         self.num_maskmem = num_maskmem  # Number of memories accessible
         # Temporal encoding of the memories
-        self.maskmem_tpos_enc = torch.nn.Parameter(
-            torch.zeros(num_maskmem, 1, 1, self.mem_dim)
-        )
         trunc_normal_(self.maskmem_tpos_enc, std=0.02)
         # a single token to indicate no memory embedding from previous frames
         self.no_mem_embed = torch.nn.Parameter(torch.zeros(1, 1, self.hidden_dim))
@@ -194,37 +176,10 @@ class SAM2Base(torch.nn.Module):
         self._build_sam_heads()
         self.max_cond_frames_in_attn = max_cond_frames_in_attn
-        # Whether to use SAMURAI or original SAM 2
-        self.samurai_mode = samurai_mode
-        # Init Kalman Filter
-        self.kf = KalmanFilter()
-        self.kf_mean = None
-        self.kf_covariance = None
-        self.stable_frames = 0
-        # Debug purpose
-        self.history = {} # debug
-        self.frame_cnt = 0 # debug
-        # Hyperparameters for SAMURAI
-        self.stable_frames_threshold = stable_frames_threshold
-        self.stable_ious_threshold = stable_ious_threshold
-        self.min_obj_score_logits = min_obj_score_logits
-        self.kf_score_weight = kf_score_weight
-        self.memory_bank_iou_threshold = memory_bank_iou_threshold
-        self.memory_bank_obj_score_threshold = memory_bank_obj_score_threshold
-        self.memory_bank_kf_score_threshold = memory_bank_kf_score_threshold
-        print(f"\033[93mSAMURAI mode: {self.samurai_mode}\033[0m")
         # Model compilation
         if compile_image_encoder:
             # Compile the forward function (not the full module) to allow loading checkpoints.
-            print(
-                "Image encoder compilation is enabled. First forward pass will be slow."
-            )
             self.image_encoder.forward = torch.compile(
                 self.image_encoder.forward,
                 mode="max-autotune",
@@ -232,6 +187,15 @@ class SAM2Base(torch.nn.Module):
                 dynamic=False,
             )
     @property
     def device(self):
         return next(self.parameters()).device
@@ -257,7 +221,9 @@ class SAM2Base(torch.nn.Module):
             ),
             input_image_size=(self.image_size, self.image_size),
             mask_in_chans=16,
         )
         self.sam_mask_decoder = MaskDecoder(
             num_multimask_outputs=3,
             transformer=TwoWayTransformer(
@@ -276,13 +242,16 @@ class SAM2Base(torch.nn.Module):
             use_multimask_token_for_obj_ptr=self.use_multimask_token_for_obj_ptr,
             **(self.sam_mask_decoder_extra_args or {}),
         )
         if self.use_obj_ptrs_in_encoder:
             # a linear projection on SAM output tokens to turn them into object pointers
             self.obj_ptr_proj = torch.nn.Linear(self.hidden_dim, self.hidden_dim)
             if self.use_mlp_for_obj_ptr_proj:
-                self.obj_ptr_proj = MLP(
-                    self.hidden_dim, self.hidden_dim, self.hidden_dim, 3
-                )
         else:
             self.obj_ptr_proj = torch.nn.Identity()
         if self.proj_tpos_enc_in_obj_ptrs:
@@ -395,7 +364,7 @@ class SAM2Base(torch.nn.Module):
             high_res_features=high_res_features,
         )
         if self.pred_obj_scores:
-            is_obj_appearing = object_score_logits > self.min_obj_score_logits
             # Mask used for spatial memories is always a *hard* choice between obj and no obj,
             # consistent with the actual mask prediction
@@ -416,87 +385,7 @@ class SAM2Base(torch.nn.Module):
         )
         sam_output_token = sam_output_tokens[:, 0]
-        kf_ious = None
-        if multimask_output and self.samurai_mode:
-            if self.kf_mean is None and self.kf_covariance is None or self.stable_frames == 0:
-                best_iou_inds = torch.argmax(ious, dim=-1)
-                batch_inds = torch.arange(B, device=device)
-                low_res_masks = low_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                high_res_masks = high_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                non_zero_indices = torch.argwhere(high_res_masks[0][0] > 0.0)
-                if len(non_zero_indices) == 0:
-                    high_res_bbox = [0, 0, 0, 0]
-                else:
-                    y_min, x_min = non_zero_indices.min(dim=0).values
-                    y_max, x_max = non_zero_indices.max(dim=0).values
-                    high_res_bbox = [x_min.item(), y_min.item(), x_max.item(), y_max.item()]
-                self.kf_mean, self.kf_covariance = self.kf.initiate(self.kf.xyxy_to_xyah(high_res_bbox))
-                if sam_output_tokens.size(1) > 1:
-                    sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
-                self.frame_cnt += 1
-                self.stable_frames += 1
-            elif self.stable_frames < self.stable_frames_threshold:
-                self.kf_mean, self.kf_covariance = self.kf.predict(self.kf_mean, self.kf_covariance)
-                best_iou_inds = torch.argmax(ious, dim=-1)
-                batch_inds = torch.arange(B, device=device)
-                low_res_masks = low_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                high_res_masks = high_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                non_zero_indices = torch.argwhere(high_res_masks[0][0] > 0.0)
-                if len(non_zero_indices) == 0:
-                    high_res_bbox = [0, 0, 0, 0]
-                else:
-                    y_min, x_min = non_zero_indices.min(dim=0).values
-                    y_max, x_max = non_zero_indices.max(dim=0).values
-                    high_res_bbox = [x_min.item(), y_min.item(), x_max.item(), y_max.item()]
-                if ious[0][best_iou_inds] > self.stable_ious_threshold:
-                    self.kf_mean, self.kf_covariance = self.kf.update(self.kf_mean, self.kf_covariance, self.kf.xyxy_to_xyah(high_res_bbox))
-                    self.stable_frames += 1
-                else:
-                    self.stable_frames = 0
-                if sam_output_tokens.size(1) > 1:
-                    sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
-                self.frame_cnt += 1
-            else:
-                self.kf_mean, self.kf_covariance = self.kf.predict(self.kf_mean, self.kf_covariance)
-                high_res_multibboxes = []
-                batch_inds = torch.arange(B, device=device)
-                for i in range(ious.shape[1]):
-                    non_zero_indices = torch.argwhere(high_res_multimasks[batch_inds, i].unsqueeze(1)[0][0] > 0.0)
-                    if len(non_zero_indices) == 0:
-                        high_res_multibboxes.append([0, 0, 0, 0])
-                    else:
-                        y_min, x_min = non_zero_indices.min(dim=0).values
-                        y_max, x_max = non_zero_indices.max(dim=0).values
-                        high_res_multibboxes.append([x_min.item(), y_min.item(), x_max.item(), y_max.item()])
-                # compute the IoU between the predicted bbox and the high_res_multibboxes
-                kf_ious = torch.tensor(self.kf.compute_iou(self.kf_mean[:4], high_res_multibboxes), device=device)
-                # weighted iou
-                weighted_ious = self.kf_score_weight * kf_ious + (1 - self.kf_score_weight) * ious
-                best_iou_inds = torch.argmax(weighted_ious, dim=-1)
-                batch_inds = torch.arange(B, device=device)
-                low_res_masks = low_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                high_res_masks = high_res_multimasks[batch_inds, best_iou_inds].unsqueeze(1)
-                if sam_output_tokens.size(1) > 1:
-                    sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
-                if False:
-                    # make all these on cpu
-                    self.history[self.frame_cnt] = {
-                        "kf_predicted_bbox": self.kf.xyah_to_xyxy(self.kf_mean[:4]),
-                        # "multi_masks": high_res_multimasks.cpu(),
-                        "ious": ious.cpu(),
-                        "multi_bboxes": high_res_multibboxes,
-                        "kf_ious": kf_ious,
-                        "weighted_ious": weighted_ious.cpu(),
-                        "final_selection": best_iou_inds.cpu(),
-                    }
-                self.frame_cnt += 1
-                if ious[0][best_iou_inds] < self.stable_ious_threshold:
-                    self.stable_frames = 0
-                else:
-                    self.kf_mean, self.kf_covariance = self.kf.update(self.kf_mean, self.kf_covariance, self.kf.xyxy_to_xyah(high_res_multibboxes[best_iou_inds]))
-        elif multimask_output and not self.samurai_mode:
             # take the best mask prediction (with the highest IoU estimation)
             best_iou_inds = torch.argmax(ious, dim=-1)
             batch_inds = torch.arange(B, device=device)
@@ -505,7 +394,6 @@ class SAM2Base(torch.nn.Module):
             if sam_output_tokens.size(1) > 1:
                 sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
         else:
-            best_iou_inds = 0
             low_res_masks, high_res_masks = low_res_multimasks, high_res_multimasks
         # Extract object pointer from the SAM output token (with occlusion handling)
@@ -529,8 +417,6 @@ class SAM2Base(torch.nn.Module):
             high_res_masks,
             obj_ptr,
             object_score_logits,
-            ious[0][best_iou_inds],
-            kf_ious[best_iou_inds] if kf_ious is not None else None,
         )
     def _use_mask_as_output(self, backbone_features, high_res_features, mask_inputs):
@@ -553,12 +439,10 @@ class SAM2Base(torch.nn.Module):
         ious = mask_inputs.new_ones(mask_inputs.size(0), 1).float()
         if not self.use_obj_ptrs_in_encoder:
             # all zeros as a dummy object pointer (of shape [B, C])
-            obj_ptr = torch.zeros(
-                mask_inputs.size(0), self.hidden_dim, device=mask_inputs.device
-            )
         else:
             # produce an object pointer using the SAM decoder from the mask input
-            _, _, _, _, _, obj_ptr, _, _, _ = self._forward_sam_heads(
                 backbone_features=backbone_features,
                 mask_inputs=self.mask_downsample(mask_inputs_float),
                 high_res_features=high_res_features,
@@ -591,12 +475,8 @@ class SAM2Base(torch.nn.Module):
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
-            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(
-                backbone_out["backbone_fpn"][0]
-            )
-            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(
-                backbone_out["backbone_fpn"][1]
-            )
         return backbone_out
     def _prepare_backbone_features(self, backbone_out):
@@ -657,63 +537,36 @@ class SAM2Base(torch.nn.Module):
             # We also allow taking the memory frame non-consecutively (with stride>1), in which case
             # we take (self.num_maskmem - 2) frames among every stride-th frames plus the last frame.
             stride = 1 if self.training else self.memory_temporal_stride_for_eval
-            if self.samurai_mode:
-                valid_indices = []
-                if frame_idx > 1:  # Ensure we have previous frames to evaluate
-                    for i in range(frame_idx - 1, 1, -1):  # Iterate backwards through previous frames
-                        iou_score = output_dict["non_cond_frame_outputs"][i]["best_iou_score"]  # Get mask affinity score
-                        obj_score = output_dict["non_cond_frame_outputs"][i]["object_score_logits"]  # Get object score
-                        kf_score = output_dict["non_cond_frame_outputs"][i]["kf_score"] if "kf_score" in output_dict["non_cond_frame_outputs"][i] else None  # Get motion score if available
-                        # Check if the scores meet the criteria for being a valid index
-                        if iou_score.item() > self.memory_bank_iou_threshold and \
-                           obj_score.item() > self.memory_bank_obj_score_threshold and \
-                           (kf_score is None or kf_score.item() > self.memory_bank_kf_score_threshold):
-                            valid_indices.insert(0, i)
-                        # Check the number of valid indices
-                        if len(valid_indices) >= self.max_obj_ptrs_in_encoder - 1:
-                            break
-                if frame_idx - 1 not in valid_indices:
-                    valid_indices.append(frame_idx - 1)
-                for t_pos in range(1, self.num_maskmem):  # Iterate over the number of mask memories
-                    idx = t_pos - self.num_maskmem  # Calculate the index for valid indices
-                    if idx < -len(valid_indices):  # Skip if index is out of bounds
-                        continue
-                    out = output_dict["non_cond_frame_outputs"].get(valid_indices[idx], None)  # Get output for the valid index
-                    if out is None:  # If not found, check unselected outputs
-                        out = unselected_cond_outputs.get(valid_indices[idx], None)
-                    t_pos_and_prevs.append((t_pos, out))  # Append the temporal position and output to the list
-            else:
-                for t_pos in range(1, self.num_maskmem):
-                    t_rel = self.num_maskmem - t_pos  # how many frames before current frame
-                    if t_rel == 1:
-                        # for t_rel == 1, we take the last frame (regardless of r)
-                        if not track_in_reverse:
-                            # the frame immediately before this frame (i.e. frame_idx - 1)
-                            prev_frame_idx = frame_idx - t_rel
-                        else:
-                            # the frame immediately after this frame (i.e. frame_idx + 1)
-                            prev_frame_idx = frame_idx + t_rel
                     else:
-                        # for t_rel >= 2, we take the memory frame from every r-th frames
-                        if not track_in_reverse:
-                            # first find the nearest frame among every r-th frames before this frame
-                            # for r=1, this would be (frame_idx - 2)
-                            prev_frame_idx = ((frame_idx - 2) // stride) * stride
-                            # then seek further among every r-th frames
-                            prev_frame_idx = prev_frame_idx - (t_rel - 2) * stride
-                        else:
-                            # first find the nearest frame among every r-th frames after this frame
-                            # for r=1, this would be (frame_idx + 2)
-                            prev_frame_idx = -(-(frame_idx + 2) // stride) * stride
-                            # then seek further among every r-th frames
-                            prev_frame_idx = prev_frame_idx + (t_rel - 2) * stride
-                    out = output_dict["non_cond_frame_outputs"].get(prev_frame_idx, None)
-                    if out is None:
-                        # If an unselected conditioning frame is among the last (self.num_maskmem - 1)
-                        # frames, we still attend to it as if it's a non-conditioning frame.
-                        out = unselected_cond_outputs.get(prev_frame_idx, None)
-                    t_pos_and_prevs.append((t_pos, out))
             for t_pos, prev in t_pos_and_prevs:
                 if prev is None:
@@ -726,9 +579,7 @@ class SAM2Base(torch.nn.Module):
                 maskmem_enc = prev["maskmem_pos_enc"][-1].to(device)
                 maskmem_enc = maskmem_enc.flatten(2).permute(2, 0, 1)
                 # Temporal positional encoding
-                maskmem_enc = (
-                    maskmem_enc + self.maskmem_tpos_enc[self.num_maskmem - t_pos - 1]
-                )
                 to_cat_memory_pos_embed.append(maskmem_enc)
             # Construct the list of past object pointers
@@ -738,20 +589,14 @@ class SAM2Base(torch.nn.Module):
                 # (optionally, only include object pointers in the past during evaluation)
                 if not self.training and self.only_obj_ptrs_in_the_past_for_eval:
                     ptr_cond_outputs = {
-                        t: out
-                        for t, out in selected_cond_outputs.items()
-                        if (t >= frame_idx if track_in_reverse else t <= frame_idx)
                     }
                 else:
                     ptr_cond_outputs = selected_cond_outputs
                 pos_and_ptrs = [
                     # Temporal pos encoding contains how far away each pointer is from current frame
                     (
-                        (
-                            (frame_idx - t) * tpos_sign_mul
-                            if self.use_signed_tpos_enc_to_obj_ptrs
-                            else abs(frame_idx - t)
-                        ),
                         out["obj_ptr"],
                     )
                     for t, out in ptr_cond_outputs.items()
@@ -761,9 +606,7 @@ class SAM2Base(torch.nn.Module):
                     t = frame_idx + t_diff if track_in_reverse else frame_idx - t_diff
                     if t < 0 or (num_frames is not None and t >= num_frames):
                         break
-                    out = output_dict["non_cond_frame_outputs"].get(
-                        t, unselected_cond_outputs.get(t, None)
-                    )
                     if out is not None:
                         pos_and_ptrs.append((t_diff, out["obj_ptr"]))
                 # If we have at least one object pointer, add them to the across attention
@@ -776,9 +619,7 @@ class SAM2Base(torch.nn.Module):
                     if self.add_tpos_enc_to_obj_ptrs:
                         t_diff_max = max_obj_ptrs_in_encoder - 1
                         tpos_dim = C if self.proj_tpos_enc_in_obj_ptrs else self.mem_dim
-                        obj_pos = torch.tensor(pos_list).to(
-                            device=device, non_blocking=True
-                        )
                         obj_pos = get_1d_sine_pe(obj_pos / t_diff_max, dim=tpos_dim)
                         obj_pos = self.obj_ptr_tpos_proj(obj_pos)
                         obj_pos = obj_pos.unsqueeze(1).expand(-1, B, self.mem_dim)
@@ -786,9 +627,7 @@ class SAM2Base(torch.nn.Module):
                         obj_pos = obj_ptrs.new_zeros(len(pos_list), B, self.mem_dim)
                     if self.mem_dim < C:
                         # split a pointer into (C // self.mem_dim) tokens for self.mem_dim < C
-                        obj_ptrs = obj_ptrs.reshape(
-                            -1, B, C // self.mem_dim, self.mem_dim
-                        )
                         obj_ptrs = obj_ptrs.permute(0, 2, 1, 3).flatten(0, 1)
                         obj_pos = obj_pos.repeat_interleave(C // self.mem_dim, dim=0)
                     to_cat_memory.append(obj_ptrs)
@@ -841,9 +680,7 @@ class SAM2Base(torch.nn.Module):
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
-            pred_masks_high_res = self._apply_non_overlapping_constraints(
-                pred_masks_high_res
-            )
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
@@ -856,18 +693,14 @@ class SAM2Base(torch.nn.Module):
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
-        maskmem_out = self.memory_encoder(
-            pix_feat, mask_for_mem, skip_mask_sigmoid=True  # sigmoid already applied
-        )
         maskmem_features = maskmem_out["vision_features"]
         maskmem_pos_enc = maskmem_out["vision_pos_enc"]
         # add a no-object embedding to the spatial memory to indicate that the frame
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
-            maskmem_features += (
-                1 - is_obj_appearing[..., None, None]
-            ) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )
@@ -891,8 +724,7 @@ class SAM2Base(torch.nn.Module):
         # High-resolution feature maps for the SAM head, reshape (HW)BC => BCHW
         if len(current_vision_feats) > 1:
             high_res_features = [
-                x.permute(1, 2, 0).view(x.size(1), x.size(2), *s)
-                for x, s in zip(current_vision_feats[:-1], feat_sizes[:-1])
             ]
         else:
             high_res_features = None
@@ -901,9 +733,7 @@ class SAM2Base(torch.nn.Module):
             # (see it as a GT mask) without using a SAM prompt encoder + mask decoder.
             pix_feat = current_vision_feats[-1].permute(1, 2, 0)
             pix_feat = pix_feat.view(-1, self.hidden_dim, *feat_sizes[-1])
-            sam_outputs = self._use_mask_as_output(
-                pix_feat, high_res_features, mask_inputs
-            )
         else:
             # fused the visual feature with previous memory features in the memory bank
             pix_feat = self._prepare_memory_conditioned_features(
@@ -1002,15 +832,11 @@ class SAM2Base(torch.nn.Module):
             high_res_masks,
             obj_ptr,
             object_score_logits,
-            best_iou_score,
-            kf_ious
         ) = sam_outputs
         current_out["pred_masks"] = low_res_masks
         current_out["pred_masks_high_res"] = high_res_masks
         current_out["obj_ptr"] = obj_ptr
-        current_out["best_iou_score"] = best_iou_score
-        current_out["kf_ious"] = kf_ious
         if not self.training:
             # Only add this in inference (to avoid unused param in activation checkpointing;
             # it's mainly used in the demo to encode spatial memories w/ consolidated masks)

 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
 import torch
 import torch.distributed
 import torch.nn.functional as F
 from torch.nn.init import trunc_normal_
+from bboxmaskpose.sam2.modeling.sam2_utils import MLP, get_1d_sine_pe, select_closest_cond_frames
+from bboxmaskpose.sam2.modeling.sam.mask_decoder import MaskDecoder
+from bboxmaskpose.sam2.modeling.sam.prompt_encoder import PromptEncoder
+from bboxmaskpose.sam2.modeling.sam.transformer import TwoWayTransformer
 # a large negative value as a placeholder score for missing objects
 NO_OBJ_SCORE = -1024.0
         # extra arguments used to construct the SAM mask decoder; if not None, it should be a dict of kwargs to be passed into `MaskDecoder` class.
         sam_mask_decoder_extra_args=None,
         compile_image_encoder: bool = False,
+        n_kpts_encoder: int = -1,
     ):
         super().__init__()
+        self.n_kpts_encoder = n_kpts_encoder
         # Part 1: the image backbone
         self.image_encoder = image_encoder
         # Use level 0, 1, 2 for high-res setting, or just level 2 for the default setting
         # Part 3: memory encoder for the previous frame's outputs
         self.memory_encoder = memory_encoder
         self.mem_dim = self.hidden_dim
+        if hasattr(self.memory_encoder, "out_proj") and hasattr(self.memory_encoder.out_proj, "weight"):
             # if there is compression of memories along channel dim
             self.mem_dim = self.memory_encoder.out_proj.weight.shape[0]
         self.num_maskmem = num_maskmem  # Number of memories accessible
         # Temporal encoding of the memories
+        self.maskmem_tpos_enc = torch.nn.Parameter(torch.zeros(num_maskmem, 1, 1, self.mem_dim))
         trunc_normal_(self.maskmem_tpos_enc, std=0.02)
         # a single token to indicate no memory embedding from previous frames
         self.no_mem_embed = torch.nn.Parameter(torch.zeros(1, 1, self.hidden_dim))
         self._build_sam_heads()
         self.max_cond_frames_in_attn = max_cond_frames_in_attn
         # Model compilation
         if compile_image_encoder:
             # Compile the forward function (not the full module) to allow loading checkpoints.
+            print("Image encoder compilation is enabled. First forward pass will be slow.")
             self.image_encoder.forward = torch.compile(
                 self.image_encoder.forward,
                 mode="max-autotune",
                 dynamic=False,
             )
+        freeze_prompt_encoder = False
+        freeze_mask_decoder = False
+        if freeze_prompt_encoder:
+            for p in self.sam_prompt_encoder.parameters():
+                p.requires_grad = False
+        if freeze_mask_decoder:
+            for p in self.sam_mask_decoder.parameters():
+                p.requires_grad = False
     @property
     def device(self):
         return next(self.parameters()).device
             ),
             input_image_size=(self.image_size, self.image_size),
             mask_in_chans=16,
+            n_kpts_encoder=self.n_kpts_encoder,
         )
         self.sam_mask_decoder = MaskDecoder(
             num_multimask_outputs=3,
             transformer=TwoWayTransformer(
             use_multimask_token_for_obj_ptr=self.use_multimask_token_for_obj_ptr,
             **(self.sam_mask_decoder_extra_args or {}),
         )
+        for p in self.sam_prompt_encoder.parameters():
+            p.requires_grad = True
+        for p in self.sam_mask_decoder.parameters():
+            p.requires_grad = True
         if self.use_obj_ptrs_in_encoder:
             # a linear projection on SAM output tokens to turn them into object pointers
             self.obj_ptr_proj = torch.nn.Linear(self.hidden_dim, self.hidden_dim)
             if self.use_mlp_for_obj_ptr_proj:
+                self.obj_ptr_proj = MLP(self.hidden_dim, self.hidden_dim, self.hidden_dim, 3)
         else:
             self.obj_ptr_proj = torch.nn.Identity()
         if self.proj_tpos_enc_in_obj_ptrs:
             high_res_features=high_res_features,
         )
         if self.pred_obj_scores:
+            is_obj_appearing = object_score_logits > 0
             # Mask used for spatial memories is always a *hard* choice between obj and no obj,
             # consistent with the actual mask prediction
         )
         sam_output_token = sam_output_tokens[:, 0]
+        if multimask_output:
             # take the best mask prediction (with the highest IoU estimation)
             best_iou_inds = torch.argmax(ious, dim=-1)
             batch_inds = torch.arange(B, device=device)
             if sam_output_tokens.size(1) > 1:
                 sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
         else:
             low_res_masks, high_res_masks = low_res_multimasks, high_res_multimasks
         # Extract object pointer from the SAM output token (with occlusion handling)
             high_res_masks,
             obj_ptr,
             object_score_logits,
         )
     def _use_mask_as_output(self, backbone_features, high_res_features, mask_inputs):
         ious = mask_inputs.new_ones(mask_inputs.size(0), 1).float()
         if not self.use_obj_ptrs_in_encoder:
             # all zeros as a dummy object pointer (of shape [B, C])
+            obj_ptr = torch.zeros(mask_inputs.size(0), self.hidden_dim, device=mask_inputs.device)
         else:
             # produce an object pointer using the SAM decoder from the mask input
+            _, _, _, _, _, obj_ptr, _ = self._forward_sam_heads(
                 backbone_features=backbone_features,
                 mask_inputs=self.mask_downsample(mask_inputs_float),
                 high_res_features=high_res_features,
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
+            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(backbone_out["backbone_fpn"][0])
+            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(backbone_out["backbone_fpn"][1])
         return backbone_out
     def _prepare_backbone_features(self, backbone_out):
             # We also allow taking the memory frame non-consecutively (with stride>1), in which case
             # we take (self.num_maskmem - 2) frames among every stride-th frames plus the last frame.
             stride = 1 if self.training else self.memory_temporal_stride_for_eval
+            for t_pos in range(1, self.num_maskmem):
+                t_rel = self.num_maskmem - t_pos  # how many frames before current frame
+                if t_rel == 1:
+                    # for t_rel == 1, we take the last frame (regardless of r)
+                    if not track_in_reverse:
+                        # the frame immediately before this frame (i.e. frame_idx - 1)
+                        prev_frame_idx = frame_idx - t_rel
                     else:
+                        # the frame immediately after this frame (i.e. frame_idx + 1)
+                        prev_frame_idx = frame_idx + t_rel
+                else:
+                    # for t_rel >= 2, we take the memory frame from every r-th frames
+                    if not track_in_reverse:
+                        # first find the nearest frame among every r-th frames before this frame
+                        # for r=1, this would be (frame_idx - 2)
+                        prev_frame_idx = ((frame_idx - 2) // stride) * stride
+                        # then seek further among every r-th frames
+                        prev_frame_idx = prev_frame_idx - (t_rel - 2) * stride
+                    else:
+                        # first find the nearest frame among every r-th frames after this frame
+                        # for r=1, this would be (frame_idx + 2)
+                        prev_frame_idx = -(-(frame_idx + 2) // stride) * stride
+                        # then seek further among every r-th frames
+                        prev_frame_idx = prev_frame_idx + (t_rel - 2) * stride
+                out = output_dict["non_cond_frame_outputs"].get(prev_frame_idx, None)
+                if out is None:
+                    # If an unselected conditioning frame is among the last (self.num_maskmem - 1)
+                    # frames, we still attend to it as if it's a non-conditioning frame.
+                    out = unselected_cond_outputs.get(prev_frame_idx, None)
+                t_pos_and_prevs.append((t_pos, out))
             for t_pos, prev in t_pos_and_prevs:
                 if prev is None:
                 maskmem_enc = prev["maskmem_pos_enc"][-1].to(device)
                 maskmem_enc = maskmem_enc.flatten(2).permute(2, 0, 1)
                 # Temporal positional encoding
+                maskmem_enc = maskmem_enc + self.maskmem_tpos_enc[self.num_maskmem - t_pos - 1]
                 to_cat_memory_pos_embed.append(maskmem_enc)
             # Construct the list of past object pointers
                 # (optionally, only include object pointers in the past during evaluation)
                 if not self.training and self.only_obj_ptrs_in_the_past_for_eval:
                     ptr_cond_outputs = {
+                        t: out for t, out in selected_cond_outputs.items() if (t >= frame_idx if track_in_reverse else t <= frame_idx)
                     }
                 else:
                     ptr_cond_outputs = selected_cond_outputs
                 pos_and_ptrs = [
                     # Temporal pos encoding contains how far away each pointer is from current frame
                     (
+                        ((frame_idx - t) * tpos_sign_mul if self.use_signed_tpos_enc_to_obj_ptrs else abs(frame_idx - t)),
                         out["obj_ptr"],
                     )
                     for t, out in ptr_cond_outputs.items()
                     t = frame_idx + t_diff if track_in_reverse else frame_idx - t_diff
                     if t < 0 or (num_frames is not None and t >= num_frames):
                         break
+                    out = output_dict["non_cond_frame_outputs"].get(t, unselected_cond_outputs.get(t, None))
                     if out is not None:
                         pos_and_ptrs.append((t_diff, out["obj_ptr"]))
                 # If we have at least one object pointer, add them to the across attention
                     if self.add_tpos_enc_to_obj_ptrs:
                         t_diff_max = max_obj_ptrs_in_encoder - 1
                         tpos_dim = C if self.proj_tpos_enc_in_obj_ptrs else self.mem_dim
+                        obj_pos = torch.tensor(pos_list).to(device=device, non_blocking=True)
                         obj_pos = get_1d_sine_pe(obj_pos / t_diff_max, dim=tpos_dim)
                         obj_pos = self.obj_ptr_tpos_proj(obj_pos)
                         obj_pos = obj_pos.unsqueeze(1).expand(-1, B, self.mem_dim)
                         obj_pos = obj_ptrs.new_zeros(len(pos_list), B, self.mem_dim)
                     if self.mem_dim < C:
                         # split a pointer into (C // self.mem_dim) tokens for self.mem_dim < C
+                        obj_ptrs = obj_ptrs.reshape(-1, B, C // self.mem_dim, self.mem_dim)
                         obj_ptrs = obj_ptrs.permute(0, 2, 1, 3).flatten(0, 1)
                         obj_pos = obj_pos.repeat_interleave(C // self.mem_dim, dim=0)
                     to_cat_memory.append(obj_ptrs)
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
+            pred_masks_high_res = self._apply_non_overlapping_constraints(pred_masks_high_res)
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
+        maskmem_out = self.memory_encoder(pix_feat, mask_for_mem, skip_mask_sigmoid=True)  # sigmoid already applied
         maskmem_features = maskmem_out["vision_features"]
         maskmem_pos_enc = maskmem_out["vision_pos_enc"]
         # add a no-object embedding to the spatial memory to indicate that the frame
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
+            maskmem_features += (1 - is_obj_appearing[..., None, None]) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )
         # High-resolution feature maps for the SAM head, reshape (HW)BC => BCHW
         if len(current_vision_feats) > 1:
             high_res_features = [
+                x.permute(1, 2, 0).view(x.size(1), x.size(2), *s) for x, s in zip(current_vision_feats[:-1], feat_sizes[:-1])
             ]
         else:
             high_res_features = None
             # (see it as a GT mask) without using a SAM prompt encoder + mask decoder.
             pix_feat = current_vision_feats[-1].permute(1, 2, 0)
             pix_feat = pix_feat.view(-1, self.hidden_dim, *feat_sizes[-1])
+            sam_outputs = self._use_mask_as_output(pix_feat, high_res_features, mask_inputs)
         else:
             # fused the visual feature with previous memory features in the memory bank
             pix_feat = self._prepare_memory_conditioned_features(
             high_res_masks,
             obj_ptr,
             object_score_logits,
         ) = sam_outputs
         current_out["pred_masks"] = low_res_masks
         current_out["pred_masks_high_res"] = high_res_masks
         current_out["obj_ptr"] = obj_ptr
         if not self.training:
             # Only add this in inference (to avoid unused param in activation checkpointing;
             # it's mainly used in the demo to encode spatial memories w/ consolidated masks)

{sam2 → bboxmaskpose/sam2}/modeling/sam2_base_pose.py RENAMED Viewed

@@ -4,20 +4,17 @@
 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
-from loguru import logger
 import torch
 import torch.distributed
 import torch.nn.functional as F
 from torch.nn.init import trunc_normal_
-from sam2.modeling.sam.mask_decoder import MaskDecoder
-from sam2.modeling.sam.pose_encoder import PoseEncoder
-from sam2.modeling.sam.transformer import TwoWayTransformer
-from sam2.modeling.sam2_utils import get_1d_sine_pe, MLP, select_closest_cond_frames
-from sam2.utils.kalman_filter import KalmanFilter
 # a large negative value as a placeholder score for missing objects
 NO_OBJ_SCORE = -1024.0
@@ -137,16 +134,12 @@ class SAM2Base(torch.nn.Module):
         # Part 3: memory encoder for the previous frame's outputs
         self.memory_encoder = memory_encoder
         self.mem_dim = self.hidden_dim
-        if hasattr(self.memory_encoder, "out_proj") and hasattr(
-            self.memory_encoder.out_proj, "weight"
-        ):
             # if there is compression of memories along channel dim
             self.mem_dim = self.memory_encoder.out_proj.weight.shape[0]
         self.num_maskmem = num_maskmem  # Number of memories accessible
         # Temporal encoding of the memories
-        self.maskmem_tpos_enc = torch.nn.Parameter(
-            torch.zeros(num_maskmem, 1, 1, self.mem_dim)
-        )
         trunc_normal_(self.maskmem_tpos_enc, std=0.02)
         # a single token to indicate no memory embedding from previous frames
         self.no_mem_embed = torch.nn.Parameter(torch.zeros(1, 1, self.hidden_dim))
@@ -205,8 +198,8 @@ class SAM2Base(torch.nn.Module):
         self.stable_frames = 0
         # Debug purpose
-        self.history = {} # debug
-        self.frame_cnt = 0 # debug
         # Hyperparameters for SAMURAI
         self.stable_frames_threshold = stable_frames_threshold
@@ -222,9 +215,7 @@ class SAM2Base(torch.nn.Module):
         # Model compilation
         if compile_image_encoder:
             # Compile the forward function (not the full module) to allow loading checkpoints.
-            print(
-                "Image encoder compilation is enabled. First forward pass will be slow."
-            )
             self.image_encoder.forward = torch.compile(
                 self.image_encoder.forward,
                 mode="max-autotune",
@@ -280,9 +271,7 @@ class SAM2Base(torch.nn.Module):
             # a linear projection on SAM output tokens to turn them into object pointers
             self.obj_ptr_proj = torch.nn.Linear(self.hidden_dim, self.hidden_dim)
             if self.use_mlp_for_obj_ptr_proj:
-                self.obj_ptr_proj = MLP(
-                    self.hidden_dim, self.hidden_dim, self.hidden_dim, 3
-                )
         else:
             self.obj_ptr_proj = torch.nn.Identity()
         if self.proj_tpos_enc_in_obj_ptrs:
@@ -480,7 +469,7 @@ class SAM2Base(torch.nn.Module):
                     sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
                 if False:
-                    # make all these on cpu
                     self.history[self.frame_cnt] = {
                         "kf_predicted_bbox": self.kf.xyah_to_xyxy(self.kf_mean[:4]),
                         # "multi_masks": high_res_multimasks.cpu(),
@@ -495,7 +484,9 @@ class SAM2Base(torch.nn.Module):
                 if ious[0][best_iou_inds] < self.stable_ious_threshold:
                     self.stable_frames = 0
                 else:
-                    self.kf_mean, self.kf_covariance = self.kf.update(self.kf_mean, self.kf_covariance, self.kf.xyxy_to_xyah(high_res_multibboxes[best_iou_inds]))
         elif multimask_output and not self.samurai_mode:
             # take the best mask prediction (with the highest IoU estimation)
             best_iou_inds = torch.argmax(ious, dim=-1)
@@ -553,9 +544,7 @@ class SAM2Base(torch.nn.Module):
         ious = mask_inputs.new_ones(mask_inputs.size(0), 1).float()
         if not self.use_obj_ptrs_in_encoder:
             # all zeros as a dummy object pointer (of shape [B, C])
-            obj_ptr = torch.zeros(
-                mask_inputs.size(0), self.hidden_dim, device=mask_inputs.device
-            )
         else:
             # produce an object pointer using the SAM decoder from the mask input
             _, _, _, _, _, obj_ptr, _, _, _ = self._forward_sam_heads(
@@ -591,12 +580,8 @@ class SAM2Base(torch.nn.Module):
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
-            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(
-                backbone_out["backbone_fpn"][0]
-            )
-            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(
-                backbone_out["backbone_fpn"][1]
-            )
         return backbone_out
     def _prepare_backbone_features(self, backbone_out):
@@ -659,21 +644,27 @@ class SAM2Base(torch.nn.Module):
             stride = 1 if self.training else self.memory_temporal_stride_for_eval
             if self.samurai_mode:
-                valid_indices = []
                 if frame_idx > 1:  # Ensure we have previous frames to evaluate
                     for i in range(frame_idx - 1, 1, -1):  # Iterate backwards through previous frames
                         iou_score = output_dict["non_cond_frame_outputs"][i]["best_iou_score"]  # Get mask affinity score
                         obj_score = output_dict["non_cond_frame_outputs"][i]["object_score_logits"]  # Get object score
-                        kf_score = output_dict["non_cond_frame_outputs"][i]["kf_score"] if "kf_score" in output_dict["non_cond_frame_outputs"][i] else None  # Get motion score if available
                         # Check if the scores meet the criteria for being a valid index
-                        if iou_score.item() > self.memory_bank_iou_threshold and \
-                           obj_score.item() > self.memory_bank_obj_score_threshold and \
-                           (kf_score is None or kf_score.item() > self.memory_bank_kf_score_threshold):
-                            valid_indices.insert(0, i)
                         # Check the number of valid indices
-                        if len(valid_indices) >= self.max_obj_ptrs_in_encoder - 1:
                             break
-                if frame_idx - 1 not in valid_indices:
                     valid_indices.append(frame_idx - 1)
                 for t_pos in range(1, self.num_maskmem):  # Iterate over the number of mask memories
                     idx = t_pos - self.num_maskmem  # Calculate the index for valid indices
@@ -726,9 +717,7 @@ class SAM2Base(torch.nn.Module):
                 maskmem_enc = prev["maskmem_pos_enc"][-1].to(device)
                 maskmem_enc = maskmem_enc.flatten(2).permute(2, 0, 1)
                 # Temporal positional encoding
-                maskmem_enc = (
-                    maskmem_enc + self.maskmem_tpos_enc[self.num_maskmem - t_pos - 1]
-                )
                 to_cat_memory_pos_embed.append(maskmem_enc)
             # Construct the list of past object pointers
@@ -738,20 +727,14 @@ class SAM2Base(torch.nn.Module):
                 # (optionally, only include object pointers in the past during evaluation)
                 if not self.training and self.only_obj_ptrs_in_the_past_for_eval:
                     ptr_cond_outputs = {
-                        t: out
-                        for t, out in selected_cond_outputs.items()
-                        if (t >= frame_idx if track_in_reverse else t <= frame_idx)
                     }
                 else:
                     ptr_cond_outputs = selected_cond_outputs
                 pos_and_ptrs = [
                     # Temporal pos encoding contains how far away each pointer is from current frame
                     (
-                        (
-                            (frame_idx - t) * tpos_sign_mul
-                            if self.use_signed_tpos_enc_to_obj_ptrs
-                            else abs(frame_idx - t)
-                        ),
                         out["obj_ptr"],
                     )
                     for t, out in ptr_cond_outputs.items()
@@ -761,9 +744,7 @@ class SAM2Base(torch.nn.Module):
                     t = frame_idx + t_diff if track_in_reverse else frame_idx - t_diff
                     if t < 0 or (num_frames is not None and t >= num_frames):
                         break
-                    out = output_dict["non_cond_frame_outputs"].get(
-                        t, unselected_cond_outputs.get(t, None)
-                    )
                     if out is not None:
                         pos_and_ptrs.append((t_diff, out["obj_ptr"]))
                 # If we have at least one object pointer, add them to the across attention
@@ -776,9 +757,7 @@ class SAM2Base(torch.nn.Module):
                     if self.add_tpos_enc_to_obj_ptrs:
                         t_diff_max = max_obj_ptrs_in_encoder - 1
                         tpos_dim = C if self.proj_tpos_enc_in_obj_ptrs else self.mem_dim
-                        obj_pos = torch.tensor(pos_list).to(
-                            device=device, non_blocking=True
-                        )
                         obj_pos = get_1d_sine_pe(obj_pos / t_diff_max, dim=tpos_dim)
                         obj_pos = self.obj_ptr_tpos_proj(obj_pos)
                         obj_pos = obj_pos.unsqueeze(1).expand(-1, B, self.mem_dim)
@@ -786,9 +765,7 @@ class SAM2Base(torch.nn.Module):
                         obj_pos = obj_ptrs.new_zeros(len(pos_list), B, self.mem_dim)
                     if self.mem_dim < C:
                         # split a pointer into (C // self.mem_dim) tokens for self.mem_dim < C
-                        obj_ptrs = obj_ptrs.reshape(
-                            -1, B, C // self.mem_dim, self.mem_dim
-                        )
                         obj_ptrs = obj_ptrs.permute(0, 2, 1, 3).flatten(0, 1)
                         obj_pos = obj_pos.repeat_interleave(C // self.mem_dim, dim=0)
                     to_cat_memory.append(obj_ptrs)
@@ -841,9 +818,7 @@ class SAM2Base(torch.nn.Module):
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
-            pred_masks_high_res = self._apply_non_overlapping_constraints(
-                pred_masks_high_res
-            )
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
@@ -856,18 +831,14 @@ class SAM2Base(torch.nn.Module):
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
-        maskmem_out = self.memory_encoder(
-            pix_feat, mask_for_mem, skip_mask_sigmoid=True  # sigmoid already applied
-        )
         maskmem_features = maskmem_out["vision_features"]
         maskmem_pos_enc = maskmem_out["vision_pos_enc"]
         # add a no-object embedding to the spatial memory to indicate that the frame
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
-            maskmem_features += (
-                1 - is_obj_appearing[..., None, None]
-            ) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )
@@ -891,8 +862,7 @@ class SAM2Base(torch.nn.Module):
         # High-resolution feature maps for the SAM head, reshape (HW)BC => BCHW
         if len(current_vision_feats) > 1:
             high_res_features = [
-                x.permute(1, 2, 0).view(x.size(1), x.size(2), *s)
-                for x, s in zip(current_vision_feats[:-1], feat_sizes[:-1])
             ]
         else:
             high_res_features = None
@@ -901,9 +871,7 @@ class SAM2Base(torch.nn.Module):
             # (see it as a GT mask) without using a SAM prompt encoder + mask decoder.
             pix_feat = current_vision_feats[-1].permute(1, 2, 0)
             pix_feat = pix_feat.view(-1, self.hidden_dim, *feat_sizes[-1])
-            sam_outputs = self._use_mask_as_output(
-                pix_feat, high_res_features, mask_inputs
-            )
         else:
             # fused the visual feature with previous memory features in the memory bank
             pix_feat = self._prepare_memory_conditioned_features(
@@ -994,17 +962,7 @@ class SAM2Base(torch.nn.Module):
             prev_sam_mask_logits,
         )
-        (
-            _,
-            _,
-            _,
-            low_res_masks,
-            high_res_masks,
-            obj_ptr,
-            object_score_logits,
-            best_iou_score,
-            kf_ious
-        ) = sam_outputs
         current_out["pred_masks"] = low_res_masks
         current_out["pred_masks_high_res"] = high_res_masks

 # This source code is licensed under the license found in the
 # LICENSE file in the root directory of this source tree.
 import torch
 import torch.distributed
 import torch.nn.functional as F
 from torch.nn.init import trunc_normal_
+from bboxmaskpose.sam2.modeling.sam2_utils import MLP, get_1d_sine_pe, select_closest_cond_frames
+from bboxmaskpose.sam2.modeling.sam.mask_decoder import MaskDecoder
+from bboxmaskpose.sam2.modeling.sam.pose_encoder import PoseEncoder
+from bboxmaskpose.sam2.modeling.sam.transformer import TwoWayTransformer
+from bboxmaskpose.sam2.utils.kalman_filter import KalmanFilter
+from loguru import logger
 # a large negative value as a placeholder score for missing objects
 NO_OBJ_SCORE = -1024.0
         # Part 3: memory encoder for the previous frame's outputs
         self.memory_encoder = memory_encoder
         self.mem_dim = self.hidden_dim
+        if hasattr(self.memory_encoder, "out_proj") and hasattr(self.memory_encoder.out_proj, "weight"):
             # if there is compression of memories along channel dim
             self.mem_dim = self.memory_encoder.out_proj.weight.shape[0]
         self.num_maskmem = num_maskmem  # Number of memories accessible
         # Temporal encoding of the memories
+        self.maskmem_tpos_enc = torch.nn.Parameter(torch.zeros(num_maskmem, 1, 1, self.mem_dim))
         trunc_normal_(self.maskmem_tpos_enc, std=0.02)
         # a single token to indicate no memory embedding from previous frames
         self.no_mem_embed = torch.nn.Parameter(torch.zeros(1, 1, self.hidden_dim))
         self.stable_frames = 0
         # Debug purpose
+        self.history = {}  # debug
+        self.frame_cnt = 0  # debug
         # Hyperparameters for SAMURAI
         self.stable_frames_threshold = stable_frames_threshold
         # Model compilation
         if compile_image_encoder:
             # Compile the forward function (not the full module) to allow loading checkpoints.
+            print("Image encoder compilation is enabled. First forward pass will be slow.")
             self.image_encoder.forward = torch.compile(
                 self.image_encoder.forward,
                 mode="max-autotune",
             # a linear projection on SAM output tokens to turn them into object pointers
             self.obj_ptr_proj = torch.nn.Linear(self.hidden_dim, self.hidden_dim)
             if self.use_mlp_for_obj_ptr_proj:
+                self.obj_ptr_proj = MLP(self.hidden_dim, self.hidden_dim, self.hidden_dim, 3)
         else:
             self.obj_ptr_proj = torch.nn.Identity()
         if self.proj_tpos_enc_in_obj_ptrs:
                     sam_output_token = sam_output_tokens[batch_inds, best_iou_inds]
                 if False:
+                    # make all these on cpu
                     self.history[self.frame_cnt] = {
                         "kf_predicted_bbox": self.kf.xyah_to_xyxy(self.kf_mean[:4]),
                         # "multi_masks": high_res_multimasks.cpu(),
                 if ious[0][best_iou_inds] < self.stable_ious_threshold:
                     self.stable_frames = 0
                 else:
+                    self.kf_mean, self.kf_covariance = self.kf.update(
+                        self.kf_mean, self.kf_covariance, self.kf.xyxy_to_xyah(high_res_multibboxes[best_iou_inds])
+                    )
         elif multimask_output and not self.samurai_mode:
             # take the best mask prediction (with the highest IoU estimation)
             best_iou_inds = torch.argmax(ious, dim=-1)
         ious = mask_inputs.new_ones(mask_inputs.size(0), 1).float()
         if not self.use_obj_ptrs_in_encoder:
             # all zeros as a dummy object pointer (of shape [B, C])
+            obj_ptr = torch.zeros(mask_inputs.size(0), self.hidden_dim, device=mask_inputs.device)
         else:
             # produce an object pointer using the SAM decoder from the mask input
             _, _, _, _, _, obj_ptr, _, _, _ = self._forward_sam_heads(
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
+            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(backbone_out["backbone_fpn"][0])
+            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(backbone_out["backbone_fpn"][1])
         return backbone_out
     def _prepare_backbone_features(self, backbone_out):
             stride = 1 if self.training else self.memory_temporal_stride_for_eval
             if self.samurai_mode:
+                valid_indices = []
                 if frame_idx > 1:  # Ensure we have previous frames to evaluate
                     for i in range(frame_idx - 1, 1, -1):  # Iterate backwards through previous frames
                         iou_score = output_dict["non_cond_frame_outputs"][i]["best_iou_score"]  # Get mask affinity score
                         obj_score = output_dict["non_cond_frame_outputs"][i]["object_score_logits"]  # Get object score
+                        kf_score = (
+                            output_dict["non_cond_frame_outputs"][i]["kf_score"]
+                            if "kf_score" in output_dict["non_cond_frame_outputs"][i]
+                            else None
+                        )  # Get motion score if available
                         # Check if the scores meet the criteria for being a valid index
+                        if (
+                            iou_score.item() > self.memory_bank_iou_threshold
+                            and obj_score.item() > self.memory_bank_obj_score_threshold
+                            and (kf_score is None or kf_score.item() > self.memory_bank_kf_score_threshold)
+                        ):
+                            valid_indices.insert(0, i)
                         # Check the number of valid indices
+                        if len(valid_indices) >= self.max_obj_ptrs_in_encoder - 1:
                             break
+                if frame_idx - 1 not in valid_indices:
                     valid_indices.append(frame_idx - 1)
                 for t_pos in range(1, self.num_maskmem):  # Iterate over the number of mask memories
                     idx = t_pos - self.num_maskmem  # Calculate the index for valid indices
                 maskmem_enc = prev["maskmem_pos_enc"][-1].to(device)
                 maskmem_enc = maskmem_enc.flatten(2).permute(2, 0, 1)
                 # Temporal positional encoding
+                maskmem_enc = maskmem_enc + self.maskmem_tpos_enc[self.num_maskmem - t_pos - 1]
                 to_cat_memory_pos_embed.append(maskmem_enc)
             # Construct the list of past object pointers
                 # (optionally, only include object pointers in the past during evaluation)
                 if not self.training and self.only_obj_ptrs_in_the_past_for_eval:
                     ptr_cond_outputs = {
+                        t: out for t, out in selected_cond_outputs.items() if (t >= frame_idx if track_in_reverse else t <= frame_idx)
                     }
                 else:
                     ptr_cond_outputs = selected_cond_outputs
                 pos_and_ptrs = [
                     # Temporal pos encoding contains how far away each pointer is from current frame
                     (
+                        ((frame_idx - t) * tpos_sign_mul if self.use_signed_tpos_enc_to_obj_ptrs else abs(frame_idx - t)),
                         out["obj_ptr"],
                     )
                     for t, out in ptr_cond_outputs.items()
                     t = frame_idx + t_diff if track_in_reverse else frame_idx - t_diff
                     if t < 0 or (num_frames is not None and t >= num_frames):
                         break
+                    out = output_dict["non_cond_frame_outputs"].get(t, unselected_cond_outputs.get(t, None))
                     if out is not None:
                         pos_and_ptrs.append((t_diff, out["obj_ptr"]))
                 # If we have at least one object pointer, add them to the across attention
                     if self.add_tpos_enc_to_obj_ptrs:
                         t_diff_max = max_obj_ptrs_in_encoder - 1
                         tpos_dim = C if self.proj_tpos_enc_in_obj_ptrs else self.mem_dim
+                        obj_pos = torch.tensor(pos_list).to(device=device, non_blocking=True)
                         obj_pos = get_1d_sine_pe(obj_pos / t_diff_max, dim=tpos_dim)
                         obj_pos = self.obj_ptr_tpos_proj(obj_pos)
                         obj_pos = obj_pos.unsqueeze(1).expand(-1, B, self.mem_dim)
                         obj_pos = obj_ptrs.new_zeros(len(pos_list), B, self.mem_dim)
                     if self.mem_dim < C:
                         # split a pointer into (C // self.mem_dim) tokens for self.mem_dim < C
+                        obj_ptrs = obj_ptrs.reshape(-1, B, C // self.mem_dim, self.mem_dim)
                         obj_ptrs = obj_ptrs.permute(0, 2, 1, 3).flatten(0, 1)
                         obj_pos = obj_pos.repeat_interleave(C // self.mem_dim, dim=0)
                     to_cat_memory.append(obj_ptrs)
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
+            pred_masks_high_res = self._apply_non_overlapping_constraints(pred_masks_high_res)
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
+        maskmem_out = self.memory_encoder(pix_feat, mask_for_mem, skip_mask_sigmoid=True)  # sigmoid already applied
         maskmem_features = maskmem_out["vision_features"]
         maskmem_pos_enc = maskmem_out["vision_pos_enc"]
         # add a no-object embedding to the spatial memory to indicate that the frame
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
+            maskmem_features += (1 - is_obj_appearing[..., None, None]) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )
         # High-resolution feature maps for the SAM head, reshape (HW)BC => BCHW
         if len(current_vision_feats) > 1:
             high_res_features = [
+                x.permute(1, 2, 0).view(x.size(1), x.size(2), *s) for x, s in zip(current_vision_feats[:-1], feat_sizes[:-1])
             ]
         else:
             high_res_features = None
             # (see it as a GT mask) without using a SAM prompt encoder + mask decoder.
             pix_feat = current_vision_feats[-1].permute(1, 2, 0)
             pix_feat = pix_feat.view(-1, self.hidden_dim, *feat_sizes[-1])
+            sam_outputs = self._use_mask_as_output(pix_feat, high_res_features, mask_inputs)
         else:
             # fused the visual feature with previous memory features in the memory bank
             pix_feat = self._prepare_memory_conditioned_features(
             prev_sam_mask_logits,
         )
+        _, _, _, low_res_masks, high_res_masks, obj_ptr, object_score_logits, best_iou_score, kf_ious = sam_outputs
         current_out["pred_masks"] = low_res_masks
         current_out["pred_masks_high_res"] = high_res_masks

{sam2 → bboxmaskpose/sam2}/modeling/sam2_utils.py RENAMED Viewed

@@ -13,7 +13,7 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
-from sam2.utils.misc import mask_to_box
 def select_closest_cond_frames(frame_idx, cond_frame_outputs, max_cond_frame_num):
@@ -54,9 +54,7 @@ def select_closest_cond_frames(frame_idx, cond_frame_outputs, max_cond_frame_num
             key=lambda x: abs(x - frame_idx),
         )[:num_remain]
         selected_outputs.update((t, cond_frame_outputs[t]) for t in inds_remain)
-        unselected_outputs = {
-            t: v for t, v in cond_frame_outputs.items() if t not in selected_outputs
-        }
     return selected_outputs, unselected_outputs
@@ -122,9 +120,7 @@ class MLP(nn.Module):
         super().__init__()
         self.num_layers = num_layers
         h = [hidden_dim] * (num_layers - 1)
-        self.layers = nn.ModuleList(
-            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim])
-        )
         self.sigmoid_output = sigmoid_output
         self.act = activation()
@@ -175,9 +171,7 @@ def sample_box_points(
     device = masks.device
     box_coords = mask_to_box(masks)
     B, _, H, W = masks.shape
-    box_labels = torch.tensor(
-        [top_left_label, bottom_right_label], dtype=torch.int, device=device
-    ).repeat(B)
     if noise > 0.0:
         if not isinstance(noise_bound, torch.Tensor):
             noise_bound = torch.tensor(noise_bound, device=device)
@@ -189,9 +183,7 @@ def sample_box_points(
         box_noise = box_noise * torch.stack((max_dx, max_dy, max_dx, max_dy), dim=-1)
         box_coords = box_coords + box_noise
-        img_bounds = (
-            torch.tensor([W, H, W, H], device=device) - 1
-        )  # uncentered pixel coords
         box_coords.clamp_(torch.zeros_like(img_bounds), img_bounds)  # In place clamping
     box_coords = box_coords.reshape(-1, 2, 2)  # always 2 points

 import torch.nn as nn
 import torch.nn.functional as F
+from bboxmaskpose.sam2.utils.misc import mask_to_box
 def select_closest_cond_frames(frame_idx, cond_frame_outputs, max_cond_frame_num):
             key=lambda x: abs(x - frame_idx),
         )[:num_remain]
         selected_outputs.update((t, cond_frame_outputs[t]) for t in inds_remain)
+        unselected_outputs = {t: v for t, v in cond_frame_outputs.items() if t not in selected_outputs}
     return selected_outputs, unselected_outputs
         super().__init__()
         self.num_layers = num_layers
         h = [hidden_dim] * (num_layers - 1)
+        self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
         self.sigmoid_output = sigmoid_output
         self.act = activation()
     device = masks.device
     box_coords = mask_to_box(masks)
     B, _, H, W = masks.shape
+    box_labels = torch.tensor([top_left_label, bottom_right_label], dtype=torch.int, device=device).repeat(B)
     if noise > 0.0:
         if not isinstance(noise_bound, torch.Tensor):
             noise_bound = torch.tensor(noise_bound, device=device)
         box_noise = box_noise * torch.stack((max_dx, max_dy, max_dx, max_dy), dim=-1)
         box_coords = box_coords + box_noise
+        img_bounds = torch.tensor([W, H, W, H], device=device) - 1  # uncentered pixel coords
         box_coords.clamp_(torch.zeros_like(img_bounds), img_bounds)  # In place clamping
     box_coords = box_coords.reshape(-1, 2, 2)  # always 2 points

{sam2 → bboxmaskpose/sam2}/sam2_image_predictor.py RENAMED Viewed

@@ -5,16 +5,14 @@
 # LICENSE file in the root directory of this source tree.
 import logging
 from typing import List, Optional, Tuple, Union
 import numpy as np
 import torch
 from PIL.Image import Image
-from sam2.modeling.sam2_base import SAM2Base
-from sam2.utils.transforms import SAM2Transforms
 class SAM2ImagePredictor:
@@ -61,9 +59,9 @@ class SAM2ImagePredictor:
         # Spatial dim for backbone feature maps
         isize = self.model.image_size
         self._bb_feat_sizes = [
-            (isize//4, isize//4),
-            (isize//8, isize//8),
-            (isize//16, isize//16),
         ]
     @classmethod
@@ -78,7 +76,7 @@ class SAM2ImagePredictor:
         Returns:
           (SAM2ImagePredictor): The loaded model.
         """
-        from sam2.build_sam import build_sam2_hf
         sam_model = build_sam2_hf(model_id, **kwargs)
         return cls(sam_model, **kwargs)
@@ -111,9 +109,7 @@ class SAM2ImagePredictor:
         input_image = self._transforms(image)
         input_image = input_image[None, ...].to(self.device)
-        assert (
-            len(input_image.shape) == 4 and input_image.shape[1] == 3
-        ), f"input_image must be of size 1x3xHxW, got {input_image.shape}"
         logging.info("Computing image embeddings for the provided image...")
         backbone_out = self.model.forward_image(input_image)
         _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
@@ -122,10 +118,9 @@ class SAM2ImagePredictor:
             vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
         # breakpoint()
-        feats = [
-            feat.permute(1, 2, 0).view(1, -1, *feat_size)
-            for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
-        ][::-1]
         self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
         self._is_image_set = True
         logging.info("Image embeddings computed.")
@@ -148,17 +143,13 @@ class SAM2ImagePredictor:
         assert isinstance(image_list, list)
         self._orig_hw = []
         for image in image_list:
-            assert isinstance(
-                image, np.ndarray
-            ), "Images are expected to be an np.ndarray in RGB format, and of shape  HWC"
             self._orig_hw.append(image.shape[:2])
         # Transform the image to the form expected by the model
         img_batch = self._transforms.forward_batch(image_list)
         img_batch = img_batch.to(self.device)
         batch_size = img_batch.shape[0]
-        assert (
-            len(img_batch.shape) == 4 and img_batch.shape[1] == 3
-        ), f"img_batch must be of size Bx3xHxW, got {img_batch.shape}"
         logging.info("Computing image embeddings for the provided images...")
         backbone_out = self.model.forward_image(img_batch)
         _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
@@ -167,8 +158,7 @@ class SAM2ImagePredictor:
             vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
         feats = [
-            feat.permute(1, 2, 0).view(batch_size, -1, *feat_size)
-            for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
         ][::-1]
         self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
         self._is_image_set = True
@@ -190,25 +180,17 @@ class SAM2ImagePredictor:
         """
         assert self._is_batch, "This function should only be used when in batched mode"
         if not self._is_image_set:
-            raise RuntimeError(
-                "An image must be set with .set_image_batch(...) before mask prediction."
-            )
         num_images = len(self._features["image_embed"])
         all_masks = []
         all_ious = []
         all_low_res_masks = []
         for img_idx in range(num_images):
             # Transform input prompts
-            point_coords = (
-                point_coords_batch[img_idx] if point_coords_batch is not None else None
-            )
-            point_labels = (
-                point_labels_batch[img_idx] if point_labels_batch is not None else None
-            )
             box = box_batch[img_idx] if box_batch is not None else None
-            mask_input = (
-                mask_input_batch[img_idx] if mask_input_batch is not None else None
-            )
             mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(
                 point_coords,
                 point_labels,
@@ -227,9 +209,7 @@ class SAM2ImagePredictor:
                 img_idx=img_idx,
             )
             masks_np = masks.squeeze(0).float().detach().cpu().numpy()
-            iou_predictions_np = (
-                iou_predictions.squeeze(0).float().detach().cpu().numpy()
-            )
             low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
             all_masks.append(masks_np)
             all_ious.append(iou_predictions_np)
@@ -281,15 +261,11 @@ class SAM2ImagePredictor:
             a subsequent iteration as mask input.
         """
         if not self._is_image_set:
-            raise RuntimeError(
-                "An image must be set with .set_image(...) before mask prediction."
-            )
         # Transform input prompts
-        mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(
-            point_coords, point_labels, box, mask_input, normalize_coords
-        )
         masks, iou_predictions, low_res_masks = self._predict(
             unnorm_coords,
@@ -305,33 +281,21 @@ class SAM2ImagePredictor:
         low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
         return masks_np, iou_predictions_np, low_res_masks_np
-    def _prep_prompts(
-        self, point_coords, point_labels, box, mask_logits, normalize_coords, img_idx=-1
-    ):
         unnorm_coords, labels, unnorm_box, mask_input = None, None, None, None
         if point_coords is not None:
-            assert (
-                point_labels is not None
-            ), "point_labels must be supplied if point_coords is supplied."
-            point_coords = torch.as_tensor(
-                point_coords, dtype=torch.float, device=self.device
-            )
-            unnorm_coords = self._transforms.transform_coords(
-                point_coords, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx]
-            )
             labels = torch.as_tensor(point_labels, dtype=torch.int, device=self.device)
             if len(unnorm_coords.shape) == 2:
                 unnorm_coords, labels = unnorm_coords[None, ...], labels[None, ...]
         if box is not None:
             box = torch.as_tensor(box, dtype=torch.float, device=self.device)
-            unnorm_box = self._transforms.transform_boxes(
-                box, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx]
-            )  # Bx2x2
         if mask_logits is not None:
-            mask_input = torch.as_tensor(
-                mask_logits, dtype=torch.float, device=self.device
-            )
             if len(mask_input.shape) == 3:
                 mask_input = mask_input[None, :, :, :]
         return mask_input, unnorm_coords, labels, unnorm_box
@@ -383,9 +347,7 @@ class SAM2ImagePredictor:
             a subsequent iteration as mask input.
         """
         if not self._is_image_set:
-            raise RuntimeError(
-                "An image must be set with .set_image(...) before mask prediction."
-            )
         if point_coords is not None:
             concat_points = (point_coords, point_labels)
@@ -413,13 +375,8 @@ class SAM2ImagePredictor:
         )
         # Predict masks
-        batched_mode = (
-            concat_points is not None and concat_points[0].shape[0] > 1
-        )  # multi object prediction
-        high_res_features = [
-            feat_level[img_idx].unsqueeze(0)
-            for feat_level in self._features["high_res_feats"]
-        ]
         low_res_masks, iou_predictions, _, _ = self.model.sam_mask_decoder(
             image_embeddings=self._features["image_embed"][img_idx].unsqueeze(0),
             image_pe=self.model.sam_prompt_encoder.get_dense_pe(),
@@ -431,9 +388,7 @@ class SAM2ImagePredictor:
         )
         # Upscale the masks to the original image resolution
-        masks = self._transforms.postprocess_masks(
-            low_res_masks, self._orig_hw[img_idx]
-        )
         low_res_masks = torch.clamp(low_res_masks, -32.0, 32.0)
         if not return_logits:
             masks = masks > self.mask_threshold
@@ -447,12 +402,8 @@ class SAM2ImagePredictor:
         the embedding spatial dimension of SAM (typically C=256, H=W=64).
         """
         if not self._is_image_set:
-            raise RuntimeError(
-                "An image must be set with .set_image(...) to generate an embedding."
-            )
-        assert (
-            self._features is not None
-        ), "Features must exist if an image has been set."
         return self._features["image_embed"]
     @property

 # LICENSE file in the root directory of this source tree.
 import logging
 from typing import List, Optional, Tuple, Union
 import numpy as np
 import torch
 from PIL.Image import Image
+from bboxmaskpose.sam2.modeling.sam2_base import SAM2Base
+from bboxmaskpose.sam2.utils.transforms import SAM2Transforms
 class SAM2ImagePredictor:
         # Spatial dim for backbone feature maps
         isize = self.model.image_size
         self._bb_feat_sizes = [
+            (isize // 4, isize // 4),
+            (isize // 8, isize // 8),
+            (isize // 16, isize // 16),
         ]
     @classmethod
         Returns:
           (SAM2ImagePredictor): The loaded model.
         """
+        from bboxmaskpose.sam2.build_sam import build_sam2_hf
         sam_model = build_sam2_hf(model_id, **kwargs)
         return cls(sam_model, **kwargs)
         input_image = self._transforms(image)
         input_image = input_image[None, ...].to(self.device)
+        assert len(input_image.shape) == 4 and input_image.shape[1] == 3, f"input_image must be of size 1x3xHxW, got {input_image.shape}"
         logging.info("Computing image embeddings for the provided image...")
         backbone_out = self.model.forward_image(input_image)
         _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
             vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
         # breakpoint()
+        feats = [feat.permute(1, 2, 0).view(1, -1, *feat_size) for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])][
+            ::-1
+        ]
         self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
         self._is_image_set = True
         logging.info("Image embeddings computed.")
         assert isinstance(image_list, list)
         self._orig_hw = []
         for image in image_list:
+            assert isinstance(image, np.ndarray), "Images are expected to be an np.ndarray in RGB format, and of shape  HWC"
             self._orig_hw.append(image.shape[:2])
         # Transform the image to the form expected by the model
         img_batch = self._transforms.forward_batch(image_list)
         img_batch = img_batch.to(self.device)
         batch_size = img_batch.shape[0]
+        assert len(img_batch.shape) == 4 and img_batch.shape[1] == 3, f"img_batch must be of size Bx3xHxW, got {img_batch.shape}"
         logging.info("Computing image embeddings for the provided images...")
         backbone_out = self.model.forward_image(img_batch)
         _, vision_feats, _, _ = self.model._prepare_backbone_features(backbone_out)
             vision_feats[-1] = vision_feats[-1] + self.model.no_mem_embed
         feats = [
+            feat.permute(1, 2, 0).view(batch_size, -1, *feat_size) for feat, feat_size in zip(vision_feats[::-1], self._bb_feat_sizes[::-1])
         ][::-1]
         self._features = {"image_embed": feats[-1], "high_res_feats": feats[:-1]}
         self._is_image_set = True
         """
         assert self._is_batch, "This function should only be used when in batched mode"
         if not self._is_image_set:
+            raise RuntimeError("An image must be set with .set_image_batch(...) before mask prediction.")
         num_images = len(self._features["image_embed"])
         all_masks = []
         all_ious = []
         all_low_res_masks = []
         for img_idx in range(num_images):
             # Transform input prompts
+            point_coords = point_coords_batch[img_idx] if point_coords_batch is not None else None
+            point_labels = point_labels_batch[img_idx] if point_labels_batch is not None else None
             box = box_batch[img_idx] if box_batch is not None else None
+            mask_input = mask_input_batch[img_idx] if mask_input_batch is not None else None
             mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(
                 point_coords,
                 point_labels,
                 img_idx=img_idx,
             )
             masks_np = masks.squeeze(0).float().detach().cpu().numpy()
+            iou_predictions_np = iou_predictions.squeeze(0).float().detach().cpu().numpy()
             low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
             all_masks.append(masks_np)
             all_ious.append(iou_predictions_np)
             a subsequent iteration as mask input.
         """
         if not self._is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
         # Transform input prompts
+        mask_input, unnorm_coords, labels, unnorm_box = self._prep_prompts(point_coords, point_labels, box, mask_input, normalize_coords)
         masks, iou_predictions, low_res_masks = self._predict(
             unnorm_coords,
         low_res_masks_np = low_res_masks.squeeze(0).float().detach().cpu().numpy()
         return masks_np, iou_predictions_np, low_res_masks_np
+    def _prep_prompts(self, point_coords, point_labels, box, mask_logits, normalize_coords, img_idx=-1):
         unnorm_coords, labels, unnorm_box, mask_input = None, None, None, None
         if point_coords is not None:
+            assert point_labels is not None, "point_labels must be supplied if point_coords is supplied."
+            point_coords = torch.as_tensor(point_coords, dtype=torch.float, device=self.device)
+            unnorm_coords = self._transforms.transform_coords(point_coords, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx])
             labels = torch.as_tensor(point_labels, dtype=torch.int, device=self.device)
             if len(unnorm_coords.shape) == 2:
                 unnorm_coords, labels = unnorm_coords[None, ...], labels[None, ...]
         if box is not None:
             box = torch.as_tensor(box, dtype=torch.float, device=self.device)
+            unnorm_box = self._transforms.transform_boxes(box, normalize=normalize_coords, orig_hw=self._orig_hw[img_idx])  # Bx2x2
         if mask_logits is not None:
+            mask_input = torch.as_tensor(mask_logits, dtype=torch.float, device=self.device)
             if len(mask_input.shape) == 3:
                 mask_input = mask_input[None, :, :, :]
         return mask_input, unnorm_coords, labels, unnorm_box
             a subsequent iteration as mask input.
         """
         if not self._is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) before mask prediction.")
         if point_coords is not None:
             concat_points = (point_coords, point_labels)
         )
         # Predict masks
+        batched_mode = concat_points is not None and concat_points[0].shape[0] > 1  # multi object prediction
+        high_res_features = [feat_level[img_idx].unsqueeze(0) for feat_level in self._features["high_res_feats"]]
         low_res_masks, iou_predictions, _, _ = self.model.sam_mask_decoder(
             image_embeddings=self._features["image_embed"][img_idx].unsqueeze(0),
             image_pe=self.model.sam_prompt_encoder.get_dense_pe(),
         )
         # Upscale the masks to the original image resolution
+        masks = self._transforms.postprocess_masks(low_res_masks, self._orig_hw[img_idx])
         low_res_masks = torch.clamp(low_res_masks, -32.0, 32.0)
         if not return_logits:
             masks = masks > self.mask_threshold
         the embedding spatial dimension of SAM (typically C=256, H=W=64).
         """
         if not self._is_image_set:
+            raise RuntimeError("An image must be set with .set_image(...) to generate an embedding.")
+        assert self._features is not None, "Features must exist if an image has been set."
         return self._features["image_embed"]
     @property

{sam2 → bboxmaskpose/sam2}/sam2_video_predictor.py RENAMED Viewed

@@ -9,7 +9,6 @@ from collections import OrderedDict
 import torch
 import torch.nn.functional as F
 from tqdm import tqdm
 from sam2.modeling.sam2_base import NO_OBJ_SCORE, SAM2Base
@@ -27,11 +26,6 @@ class SAM2VideoPredictor(SAM2Base):
         # whether to clear non-conditioning memory of the surrounding frames (which may contain outdated information) after adding correction clicks;
         # note that this would only apply to *single-object tracking* unless `clear_non_cond_mem_for_multi_obj` is also set to True)
         clear_non_cond_mem_around_input=False,
-<<<<<<< HEAD
-        # whether to also clear non-conditioning memory of the surrounding frames (only effective when `clear_non_cond_mem_around_input` is True).
-        clear_non_cond_mem_for_multi_obj=False,
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         # if `add_all_frames_to_correct_as_cond` is True, we also append to the conditioning frame list any frame that receives a later correction click
         # if `add_all_frames_to_correct_as_cond` is False, we conditioning frame list to only use those initial conditioning frames
         add_all_frames_to_correct_as_cond=False,
@@ -41,10 +35,6 @@ class SAM2VideoPredictor(SAM2Base):
         self.fill_hole_area = fill_hole_area
         self.non_overlap_masks = non_overlap_masks
         self.clear_non_cond_mem_around_input = clear_non_cond_mem_around_input
-<<<<<<< HEAD
-        self.clear_non_cond_mem_for_multi_obj = clear_non_cond_mem_for_multi_obj
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         self.add_all_frames_to_correct_as_cond = add_all_frames_to_correct_as_cond
     @torch.inference_mode()
@@ -296,9 +286,7 @@ class SAM2VideoPredictor(SAM2Base):
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
-        _, video_res_masks = self._get_orig_video_res_output(
-            inference_state, consolidated_out["pred_masks_video_res"]
-        )
         return frame_idx, obj_ids, video_res_masks
     def add_new_points(self, *args, **kwargs):
@@ -384,9 +372,7 @@ class SAM2VideoPredictor(SAM2Base):
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
-        _, video_res_masks = self._get_orig_video_res_output(
-            inference_state, consolidated_out["pred_masks_video_res"]
-        )
         return frame_idx, obj_ids, video_res_masks
     def _get_orig_video_res_output(self, inference_state, any_res_masks):
@@ -450,23 +436,6 @@ class SAM2VideoPredictor(SAM2Base):
                 dtype=torch.float32,
                 device=inference_state["storage_device"],
             ),
-<<<<<<< HEAD
-            "obj_ptr": torch.full(
-                size=(batch_size, self.hidden_dim),
-                fill_value=NO_OBJ_SCORE,
-                dtype=torch.float32,
-                device=inference_state["device"],
-            ),
-            "object_score_logits": torch.full(
-                size=(batch_size, 1),
-                # default to 10.0 for object_score_logits, i.e. assuming the object is
-                # present as sigmoid(10)=1, same as in `predict_masks` of `MaskDecoder`
-                fill_value=10.0,
-                dtype=torch.float32,
-                device=inference_state["device"],
-            ),
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         }
         for obj_idx in range(batch_size):
             obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
@@ -499,36 +468,6 @@ class SAM2VideoPredictor(SAM2Base):
                     align_corners=False,
                 )
                 consolidated_pred_masks[obj_idx : obj_idx + 1] = resized_obj_mask
-<<<<<<< HEAD
-            consolidated_out["obj_ptr"][obj_idx : obj_idx + 1] = out["obj_ptr"]
-            consolidated_out["object_score_logits"][obj_idx : obj_idx + 1] = out[
-                "object_score_logits"
-            ]
-        # Optionally, apply non-overlapping constraints on the consolidated scores
-        # and rerun the memory encoder
-        if run_mem_encoder:
-            device = inference_state["device"]
-            high_res_masks = torch.nn.functional.interpolate(
-                consolidated_out["pred_masks"].to(device, non_blocking=True),
-                size=(self.image_size, self.image_size),
-                mode="bilinear",
-                align_corners=False,
-            )
-            if self.non_overlap_masks_for_mem_enc:
-                high_res_masks = self._apply_non_overlapping_constraints(high_res_masks)
-            maskmem_features, maskmem_pos_enc = self._run_memory_encoder(
-                inference_state=inference_state,
-                frame_idx=frame_idx,
-                batch_size=batch_size,
-                high_res_masks=high_res_masks,
-                object_score_logits=consolidated_out["object_score_logits"],
-                is_mask_from_pts=True,  # these frames are what the user interacted with
-            )
-            consolidated_out["maskmem_features"] = maskmem_features
-            consolidated_out["maskmem_pos_enc"] = maskmem_pos_enc
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         return consolidated_out
@@ -538,9 +477,7 @@ class SAM2VideoPredictor(SAM2Base):
         # Check and make sure that every object has received input points or masks.
         batch_size = self._get_obj_num(inference_state)
         if batch_size == 0:
-            raise RuntimeError(
-                "No input points or masks are provided for any object; please add inputs first."
-            )
         # Consolidate per-object temporary outputs in "temp_output_dict_per_obj" and
         # add them into "output_dict".
@@ -549,9 +486,7 @@ class SAM2VideoPredictor(SAM2Base):
             obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
             for is_cond in [False, True]:
                 # Separately consolidate conditioning and non-conditioning temp outputs
-                storage_key = (
-                    "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
-                )
                 # Find all the frames that contain temporary outputs for any objects
                 # (these should be the frames that have just received clicks for mask inputs
                 # via `add_new_points_or_box` or `add_new_mask`)
@@ -579,9 +514,7 @@ class SAM2VideoPredictor(SAM2Base):
                     obj_output_dict[storage_key][frame_idx] = out
                     if self.clear_non_cond_mem_around_input:
                         # clear non-conditioning memory of the surrounding frames
-                        self._clear_obj_non_cond_mem_around_input(
-                            inference_state, frame_idx, obj_idx
-                        )
                 # clear temporary outputs in `temp_output_dict_per_obj`
                 obj_temp_output_dict[storage_key].clear()
@@ -590,9 +523,7 @@ class SAM2VideoPredictor(SAM2Base):
             obj_output_dict = inference_state["output_dict_per_obj"][obj_idx]
             if len(obj_output_dict["cond_frame_outputs"]) == 0:
                 obj_id = self._obj_idx_to_id(inference_state, obj_idx)
-                raise RuntimeError(
-                    f"No input points or masks are provided for object id {obj_id}; please add inputs first."
-                )
             # edge case: if an output is added to "cond_frame_outputs", we remove any prior
             # output on the same frame in "non_cond_frame_outputs"
             for frame_idx in obj_output_dict["cond_frame_outputs"]:
@@ -617,9 +548,7 @@ class SAM2VideoPredictor(SAM2Base):
         if start_frame_idx is None:
             # default: start from the earliest frame with input points
             start_frame_idx = min(
-                t
-                for obj_output_dict in inference_state["output_dict_per_obj"].values()
-                for t in obj_output_dict["cond_frame_outputs"]
             )
         if max_frame_num_to_track is None:
             # default: track all the frames in the video
@@ -631,9 +560,7 @@ class SAM2VideoPredictor(SAM2Base):
             else:
                 processing_order = []  # skip reverse tracking if starting from frame 0
         else:
-            end_frame_idx = min(
-                start_frame_idx + max_frame_num_to_track, num_frames - 1
-            )
             processing_order = range(start_frame_idx, end_frame_idx + 1)
         for frame_idx in tqdm(processing_order, desc="propagate in video"):
@@ -651,9 +578,7 @@ class SAM2VideoPredictor(SAM2Base):
                     pred_masks = current_out["pred_masks"].to(device, non_blocking=True)
                     if self.clear_non_cond_mem_around_input:
                         # clear non-conditioning memory of the surrounding frames
-                        self._clear_obj_non_cond_mem_around_input(
-                            inference_state, frame_idx, obj_idx
-                        )
                 else:
                     storage_key = "non_cond_frame_outputs"
                     current_out, pred_masks = self._run_single_frame_inference(
@@ -669,9 +594,7 @@ class SAM2VideoPredictor(SAM2Base):
                     )
                     obj_output_dict[storage_key][frame_idx] = current_out
-                inference_state["frames_tracked_per_obj"][obj_idx][frame_idx] = {
-                    "reverse": reverse
-                }
                 pred_masks_per_obj[obj_idx] = pred_masks
             # Resize the output mask to the original video resolution (we directly use
@@ -680,42 +603,11 @@ class SAM2VideoPredictor(SAM2Base):
                 all_pred_masks = torch.cat(pred_masks_per_obj, dim=0)
             else:
                 all_pred_masks = pred_masks_per_obj[0]
-            _, video_res_masks = self._get_orig_video_res_output(
-                inference_state, all_pred_masks
-            )
             yield frame_idx, obj_ids, video_res_masks
     @torch.inference_mode()
-    def clear_all_prompts_in_frame(
-        self, inference_state, frame_idx, obj_id, need_output=True
-    ):
-<<<<<<< HEAD
-        """
-        Split a multi-object output into per-object output slices and add them into
-        `output_dict_per_obj`. The resulting slices share the same tensor storage.
-        """
-        maskmem_features = current_out["maskmem_features"]
-        assert maskmem_features is None or isinstance(maskmem_features, torch.Tensor)
-        maskmem_pos_enc = current_out["maskmem_pos_enc"]
-        assert maskmem_pos_enc is None or isinstance(maskmem_pos_enc, list)
-        output_dict_per_obj = inference_state["output_dict_per_obj"]
-        for obj_idx, obj_output_dict in output_dict_per_obj.items():
-            obj_slice = slice(obj_idx, obj_idx + 1)
-            obj_out = {
-                "maskmem_features": None,
-                "maskmem_pos_enc": None,
-                "pred_masks": current_out["pred_masks"][obj_slice],
-                "obj_ptr": current_out["obj_ptr"][obj_slice],
-                "object_score_logits": current_out["object_score_logits"][obj_slice],
-            }
-            if maskmem_features is not None:
-                obj_out["maskmem_features"] = maskmem_features[obj_slice]
-            if maskmem_pos_enc is not None:
-                obj_out["maskmem_pos_enc"] = [x[obj_slice] for x in maskmem_pos_enc]
-            obj_output_dict[storage_key][frame_idx] = obj_out
-=======
         """Remove all input points or mask in a specific frame for a given object."""
         obj_idx = self._obj_id_to_idx(inference_state, obj_id)
@@ -740,91 +632,14 @@ class SAM2VideoPredictor(SAM2Base):
             return
         # Finally, output updated masks per object (after removing the inputs above)
         obj_ids = inference_state["obj_ids"]
-        is_cond = any(
-            frame_idx in obj_temp_output_dict["cond_frame_outputs"]
-            for obj_temp_output_dict in temp_output_dict_per_obj.values()
-        )
         consolidated_out = self._consolidate_temp_output_across_obj(
             inference_state,
             frame_idx,
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
-        _, video_res_masks = self._get_orig_video_res_output(
-            inference_state, consolidated_out["pred_masks_video_res"]
-        )
-        return frame_idx, obj_ids, video_res_masks
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
-    @torch.inference_mode()
-    def clear_all_prompts_in_frame(
-        self, inference_state, frame_idx, obj_id, need_output=True
-    ):
-        """Remove all input points or mask in a specific frame for a given object."""
-        obj_idx = self._obj_id_to_idx(inference_state, obj_id)
-        # Clear the conditioning information on the given frame
-        inference_state["point_inputs_per_obj"][obj_idx].pop(frame_idx, None)
-        inference_state["mask_inputs_per_obj"][obj_idx].pop(frame_idx, None)
-        temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
-        temp_output_dict_per_obj[obj_idx]["cond_frame_outputs"].pop(frame_idx, None)
-        temp_output_dict_per_obj[obj_idx]["non_cond_frame_outputs"].pop(frame_idx, None)
-        # Check and see if there are still any inputs left on this frame
-        batch_size = self._get_obj_num(inference_state)
-        frame_has_input = False
-        for obj_idx2 in range(batch_size):
-            if frame_idx in inference_state["point_inputs_per_obj"][obj_idx2]:
-                frame_has_input = True
-                break
-            if frame_idx in inference_state["mask_inputs_per_obj"][obj_idx2]:
-                frame_has_input = True
-                break
-        # If this frame has no remaining inputs for any objects, we further clear its
-        # conditioning frame status
-        if not frame_has_input:
-            output_dict = inference_state["output_dict"]
-            consolidated_frame_inds = inference_state["consolidated_frame_inds"]
-            consolidated_frame_inds["cond_frame_outputs"].discard(frame_idx)
-            consolidated_frame_inds["non_cond_frame_outputs"].discard(frame_idx)
-            # Remove the frame's conditioning output (possibly downgrading it to non-conditioning)
-            out = output_dict["cond_frame_outputs"].pop(frame_idx, None)
-            if out is not None:
-                # The frame is not a conditioning frame anymore since it's not receiving inputs,
-                # so we "downgrade" its output (if exists) to a non-conditioning frame output.
-                output_dict["non_cond_frame_outputs"][frame_idx] = out
-                inference_state["frames_already_tracked"].pop(frame_idx, None)
-            # Similarly, do it for the sliced output on each object.
-            for obj_idx2 in range(batch_size):
-                obj_output_dict = inference_state["output_dict_per_obj"][obj_idx2]
-                obj_out = obj_output_dict["cond_frame_outputs"].pop(frame_idx, None)
-                if obj_out is not None:
-                    obj_output_dict["non_cond_frame_outputs"][frame_idx] = obj_out
-            # If all the conditioning frames have been removed, we also clear the tracking outputs
-            if len(output_dict["cond_frame_outputs"]) == 0:
-                self._reset_tracking_results(inference_state)
-        if not need_output:
-            return
-        # Finally, output updated masks per object (after removing the inputs above)
-        obj_ids = inference_state["obj_ids"]
-        is_cond = any(
-            frame_idx in obj_temp_output_dict["cond_frame_outputs"]
-            for obj_temp_output_dict in temp_output_dict_per_obj.values()
-        )
-        consolidated_out = self._consolidate_temp_output_across_obj(
-            inference_state,
-            frame_idx,
-            is_cond=is_cond,
-            run_mem_encoder=False,
-            consolidate_at_video_res=True,
-        )
-        _, video_res_masks = self._get_orig_video_res_output(
-            inference_state, consolidated_out["pred_masks_video_res"]
-        )
         return frame_idx, obj_ids, video_res_masks
     @torch.inference_mode()
@@ -859,9 +674,7 @@ class SAM2VideoPredictor(SAM2Base):
     def _get_image_feature(self, inference_state, frame_idx, batch_size):
         """Compute the image features on a given frame."""
         # Look up in the cache first
-        image, backbone_out = inference_state["cached_features"].get(
-            frame_idx, (None, None)
-        )
         if backbone_out is None:
             # Cache miss -- we will run inference on a single image
             device = inference_state["device"]
@@ -878,9 +691,7 @@ class SAM2VideoPredictor(SAM2Base):
             "vision_pos_enc": backbone_out["vision_pos_enc"].copy(),
         }
         for i, feat in enumerate(expanded_backbone_out["backbone_fpn"]):
-            expanded_backbone_out["backbone_fpn"][i] = feat.expand(
-                batch_size, -1, -1, -1
-            )
         for i, pos in enumerate(expanded_backbone_out["vision_pos_enc"]):
             pos = pos.expand(batch_size, -1, -1, -1)
             expanded_backbone_out["vision_pos_enc"][i] = pos
@@ -935,33 +746,23 @@ class SAM2VideoPredictor(SAM2Base):
         if maskmem_features is not None:
             maskmem_features = maskmem_features.to(torch.bfloat16)
             maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
-        pred_masks_gpu = current_out["pred_masks"] # (B, 1, H, W)
         # potentially fill holes in the predicted masks
         if self.fill_hole_area > 0:
-            pred_masks_gpu = fill_holes_in_mask_scores(
-                pred_masks_gpu, self.fill_hole_area
-            )
         pred_masks = pred_masks_gpu.to(storage_device, non_blocking=True)
         # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
         maskmem_pos_enc = self._get_maskmem_pos_enc(inference_state, current_out)
         # object pointer is a small tensor, so we always keep it on GPU memory for fast access
         obj_ptr = current_out["obj_ptr"]
         object_score_logits = current_out["object_score_logits"]
-<<<<<<< HEAD
-        best_iou_score = current_out["best_iou_score"]
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         # make a compact version of this frame's output to reduce the state size
         compact_current_out = {
-            "maskmem_features": maskmem_features, # (B, C, H, W)
-            "maskmem_pos_enc": maskmem_pos_enc,
             "pred_masks": pred_masks,
             "obj_ptr": obj_ptr,
             "object_score_logits": object_score_logits,
-<<<<<<< HEAD
-            "best_iou_score": best_iou_score,
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         }
         return compact_current_out, pred_masks_gpu
@@ -980,9 +781,7 @@ class SAM2VideoPredictor(SAM2Base):
         memory also need to be computed again with the memory encoder.
         """
         # Retrieve correct image features
-        _, _, current_vision_feats, _, feat_sizes = self._get_image_feature(
-            inference_state, frame_idx, batch_size
-        )
         maskmem_features, maskmem_pos_enc = self._encode_new_memory(
             current_vision_feats=current_vision_feats,
             feat_sizes=feat_sizes,
@@ -996,9 +795,7 @@ class SAM2VideoPredictor(SAM2Base):
         maskmem_features = maskmem_features.to(torch.bfloat16)
         maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
         # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
-        maskmem_pos_enc = self._get_maskmem_pos_enc(
-            inference_state, {"maskmem_pos_enc": maskmem_pos_enc}
-        )
         return maskmem_features, maskmem_pos_enc
     def _get_maskmem_pos_enc(self, inference_state, current_out):
@@ -1019,9 +816,7 @@ class SAM2VideoPredictor(SAM2Base):
                 maskmem_pos_enc = model_constants["maskmem_pos_enc"]
             # expand the cached maskmem_pos_enc to the actual batch size
             batch_size = out_maskmem_pos_enc[0].size(0)
-            expanded_maskmem_pos_enc = [
-                x.expand(batch_size, -1, -1, -1) for x in maskmem_pos_enc
-            ]
         else:
             expanded_maskmem_pos_enc = None
         return expanded_maskmem_pos_enc
@@ -1039,8 +834,7 @@ class SAM2VideoPredictor(SAM2Base):
             if not strict:
                 return inference_state["obj_ids"], updated_frames
             raise RuntimeError(
-                f"Cannot remove object id {obj_id} as it doesn't exist. "
-                f"All existing object ids: {inference_state['obj_ids']}."
             )
         # If this is the only remaining object id, we simply reset the state.
@@ -1054,16 +848,10 @@ class SAM2VideoPredictor(SAM2Base):
         # (note that this step is required as it might downgrade conditioning frames to
         # non-conditioning ones)
         obj_input_frames_inds = set()
-        obj_input_frames_inds.update(
-            inference_state["point_inputs_per_obj"][old_obj_idx_to_rm]
-        )
-        obj_input_frames_inds.update(
-            inference_state["mask_inputs_per_obj"][old_obj_idx_to_rm]
-        )
         for frame_idx in obj_input_frames_inds:
-            self.clear_all_prompts_in_frame(
-                inference_state, frame_idx, obj_id, need_output=False
-            )
         # Step 1: Update the object id mapping (note that it must be done after Step 0,
         # since Step 0 still requires the old object id mappings in inference_state)
@@ -1080,11 +868,6 @@ class SAM2VideoPredictor(SAM2Base):
         inference_state["obj_ids"] = new_obj_ids
         # Step 2: For per-object tensor storage, we shift their obj_idx in the dict keys.
-<<<<<<< HEAD
-        # (note that "consolidated_frame_inds" doesn't need to be updated in this step as
-        # it's already handled in Step 0)
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         def _map_keys(container):
             new_kvs = []
             for k in old_obj_inds:
@@ -1097,57 +880,23 @@ class SAM2VideoPredictor(SAM2Base):
         _map_keys(inference_state["mask_inputs_per_obj"])
         _map_keys(inference_state["output_dict_per_obj"])
         _map_keys(inference_state["temp_output_dict_per_obj"])
-<<<<<<< HEAD
-        # Step 3: For packed tensor storage, we index the remaining ids and rebuild the per-object slices.
-        def _slice_state(output_dict, storage_key):
-            for frame_idx, out in output_dict[storage_key].items():
-                out["maskmem_features"] = out["maskmem_features"][remain_old_obj_inds]
-                out["maskmem_pos_enc"] = [
-                    x[remain_old_obj_inds] for x in out["maskmem_pos_enc"]
-                ]
-                # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
-                out["maskmem_pos_enc"] = self._get_maskmem_pos_enc(inference_state, out)
-                out["pred_masks"] = out["pred_masks"][remain_old_obj_inds]
-                out["obj_ptr"] = out["obj_ptr"][remain_old_obj_inds]
-                out["object_score_logits"] = out["object_score_logits"][
-                    remain_old_obj_inds
-                ]
-                # also update the per-object slices
-                self._add_output_per_object(
-                    inference_state, frame_idx, out, storage_key
-                )
-        _slice_state(inference_state["output_dict"], "cond_frame_outputs")
-        _slice_state(inference_state["output_dict"], "non_cond_frame_outputs")
-        # Step 4: Further collect the outputs on those frames in `obj_input_frames_inds`, which
-=======
         _map_keys(inference_state["frames_tracked_per_obj"])
         # Step 3: Further collect the outputs on those frames in `obj_input_frames_inds`, which
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
         # could show an updated mask for objects previously occluded by the object being removed
         if need_output:
             temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
             for frame_idx in obj_input_frames_inds:
                 is_cond = any(
-                    frame_idx in obj_temp_output_dict["cond_frame_outputs"]
-                    for obj_temp_output_dict in temp_output_dict_per_obj.values()
                 )
                 consolidated_out = self._consolidate_temp_output_across_obj(
                     inference_state,
                     frame_idx,
                     is_cond=is_cond,
-<<<<<<< HEAD
-                    run_mem_encoder=False,
-=======
->>>>>>> 2b90b9f5ceec907a1c18123530e92e794ad901a4
                     consolidate_at_video_res=True,
                 )
-                _, video_res_masks = self._get_orig_video_res_output(
-                    inference_state, consolidated_out["pred_masks_video_res"]
-                )
                 updated_frames.append((frame_idx, video_res_masks))
         return inference_state["obj_ids"], updated_frames
@@ -1218,18 +967,12 @@ class SAM2VideoPredictorVOS(SAM2VideoPredictor):
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
-            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(
-                backbone_out["backbone_fpn"][0]
-            )
-            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(
-                backbone_out["backbone_fpn"][1]
-            )
         # Clone to help torch.compile
         for i in range(len(backbone_out["backbone_fpn"])):
             backbone_out["backbone_fpn"][i] = backbone_out["backbone_fpn"][i].clone()
-            backbone_out["vision_pos_enc"][i] = backbone_out["vision_pos_enc"][
-                i
-            ].clone()
         return backbone_out
     def _forward_sam_heads(
@@ -1388,9 +1131,7 @@ class SAM2VideoPredictorVOS(SAM2VideoPredictor):
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
-            pred_masks_high_res = self._apply_non_overlapping_constraints(
-                pred_masks_high_res
-            )
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
@@ -1403,9 +1144,7 @@ class SAM2VideoPredictorVOS(SAM2VideoPredictor):
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
-        maskmem_out = self.memory_encoder(
-            pix_feat, mask_for_mem, skip_mask_sigmoid=True  # sigmoid already applied
-        )
         # Clone the feats and pos_enc to enable compilation
         maskmem_features = maskmem_out["vision_features"].clone()
         maskmem_pos_enc = [m.clone() for m in maskmem_out["vision_pos_enc"]]
@@ -1413,9 +1152,7 @@ class SAM2VideoPredictorVOS(SAM2VideoPredictor):
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
-            maskmem_features += (
-                1 - is_obj_appearing[..., None, None]
-            ) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )

 import torch
 import torch.nn.functional as F
 from tqdm import tqdm
 from sam2.modeling.sam2_base import NO_OBJ_SCORE, SAM2Base
         # whether to clear non-conditioning memory of the surrounding frames (which may contain outdated information) after adding correction clicks;
         # note that this would only apply to *single-object tracking* unless `clear_non_cond_mem_for_multi_obj` is also set to True)
         clear_non_cond_mem_around_input=False,
         # if `add_all_frames_to_correct_as_cond` is True, we also append to the conditioning frame list any frame that receives a later correction click
         # if `add_all_frames_to_correct_as_cond` is False, we conditioning frame list to only use those initial conditioning frames
         add_all_frames_to_correct_as_cond=False,
         self.fill_hole_area = fill_hole_area
         self.non_overlap_masks = non_overlap_masks
         self.clear_non_cond_mem_around_input = clear_non_cond_mem_around_input
         self.add_all_frames_to_correct_as_cond = add_all_frames_to_correct_as_cond
     @torch.inference_mode()
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
+        _, video_res_masks = self._get_orig_video_res_output(inference_state, consolidated_out["pred_masks_video_res"])
         return frame_idx, obj_ids, video_res_masks
     def add_new_points(self, *args, **kwargs):
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
+        _, video_res_masks = self._get_orig_video_res_output(inference_state, consolidated_out["pred_masks_video_res"])
         return frame_idx, obj_ids, video_res_masks
     def _get_orig_video_res_output(self, inference_state, any_res_masks):
                 dtype=torch.float32,
                 device=inference_state["storage_device"],
             ),
         }
         for obj_idx in range(batch_size):
             obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
                     align_corners=False,
                 )
                 consolidated_pred_masks[obj_idx : obj_idx + 1] = resized_obj_mask
         return consolidated_out
         # Check and make sure that every object has received input points or masks.
         batch_size = self._get_obj_num(inference_state)
         if batch_size == 0:
+            raise RuntimeError("No input points or masks are provided for any object; please add inputs first.")
         # Consolidate per-object temporary outputs in "temp_output_dict_per_obj" and
         # add them into "output_dict".
             obj_temp_output_dict = inference_state["temp_output_dict_per_obj"][obj_idx]
             for is_cond in [False, True]:
                 # Separately consolidate conditioning and non-conditioning temp outputs
+                storage_key = "cond_frame_outputs" if is_cond else "non_cond_frame_outputs"
                 # Find all the frames that contain temporary outputs for any objects
                 # (these should be the frames that have just received clicks for mask inputs
                 # via `add_new_points_or_box` or `add_new_mask`)
                     obj_output_dict[storage_key][frame_idx] = out
                     if self.clear_non_cond_mem_around_input:
                         # clear non-conditioning memory of the surrounding frames
+                        self._clear_obj_non_cond_mem_around_input(inference_state, frame_idx, obj_idx)
                 # clear temporary outputs in `temp_output_dict_per_obj`
                 obj_temp_output_dict[storage_key].clear()
             obj_output_dict = inference_state["output_dict_per_obj"][obj_idx]
             if len(obj_output_dict["cond_frame_outputs"]) == 0:
                 obj_id = self._obj_idx_to_id(inference_state, obj_idx)
+                raise RuntimeError(f"No input points or masks are provided for object id {obj_id}; please add inputs first.")
             # edge case: if an output is added to "cond_frame_outputs", we remove any prior
             # output on the same frame in "non_cond_frame_outputs"
             for frame_idx in obj_output_dict["cond_frame_outputs"]:
         if start_frame_idx is None:
             # default: start from the earliest frame with input points
             start_frame_idx = min(
+                t for obj_output_dict in inference_state["output_dict_per_obj"].values() for t in obj_output_dict["cond_frame_outputs"]
             )
         if max_frame_num_to_track is None:
             # default: track all the frames in the video
             else:
                 processing_order = []  # skip reverse tracking if starting from frame 0
         else:
+            end_frame_idx = min(start_frame_idx + max_frame_num_to_track, num_frames - 1)
             processing_order = range(start_frame_idx, end_frame_idx + 1)
         for frame_idx in tqdm(processing_order, desc="propagate in video"):
                     pred_masks = current_out["pred_masks"].to(device, non_blocking=True)
                     if self.clear_non_cond_mem_around_input:
                         # clear non-conditioning memory of the surrounding frames
+                        self._clear_obj_non_cond_mem_around_input(inference_state, frame_idx, obj_idx)
                 else:
                     storage_key = "non_cond_frame_outputs"
                     current_out, pred_masks = self._run_single_frame_inference(
                     )
                     obj_output_dict[storage_key][frame_idx] = current_out
+                inference_state["frames_tracked_per_obj"][obj_idx][frame_idx] = {"reverse": reverse}
                 pred_masks_per_obj[obj_idx] = pred_masks
             # Resize the output mask to the original video resolution (we directly use
                 all_pred_masks = torch.cat(pred_masks_per_obj, dim=0)
             else:
                 all_pred_masks = pred_masks_per_obj[0]
+            _, video_res_masks = self._get_orig_video_res_output(inference_state, all_pred_masks)
             yield frame_idx, obj_ids, video_res_masks
     @torch.inference_mode()
+    def clear_all_prompts_in_frame(self, inference_state, frame_idx, obj_id, need_output=True):
         """Remove all input points or mask in a specific frame for a given object."""
         obj_idx = self._obj_id_to_idx(inference_state, obj_id)
             return
         # Finally, output updated masks per object (after removing the inputs above)
         obj_ids = inference_state["obj_ids"]
+        is_cond = any(frame_idx in obj_temp_output_dict["cond_frame_outputs"] for obj_temp_output_dict in temp_output_dict_per_obj.values())
         consolidated_out = self._consolidate_temp_output_across_obj(
             inference_state,
             frame_idx,
             is_cond=is_cond,
             consolidate_at_video_res=True,
         )
+        _, video_res_masks = self._get_orig_video_res_output(inference_state, consolidated_out["pred_masks_video_res"])
         return frame_idx, obj_ids, video_res_masks
     @torch.inference_mode()
     def _get_image_feature(self, inference_state, frame_idx, batch_size):
         """Compute the image features on a given frame."""
         # Look up in the cache first
+        image, backbone_out = inference_state["cached_features"].get(frame_idx, (None, None))
         if backbone_out is None:
             # Cache miss -- we will run inference on a single image
             device = inference_state["device"]
             "vision_pos_enc": backbone_out["vision_pos_enc"].copy(),
         }
         for i, feat in enumerate(expanded_backbone_out["backbone_fpn"]):
+            expanded_backbone_out["backbone_fpn"][i] = feat.expand(batch_size, -1, -1, -1)
         for i, pos in enumerate(expanded_backbone_out["vision_pos_enc"]):
             pos = pos.expand(batch_size, -1, -1, -1)
             expanded_backbone_out["vision_pos_enc"][i] = pos
         if maskmem_features is not None:
             maskmem_features = maskmem_features.to(torch.bfloat16)
             maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
+        pred_masks_gpu = current_out["pred_masks"]
         # potentially fill holes in the predicted masks
         if self.fill_hole_area > 0:
+            pred_masks_gpu = fill_holes_in_mask_scores(pred_masks_gpu, self.fill_hole_area)
         pred_masks = pred_masks_gpu.to(storage_device, non_blocking=True)
         # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
         maskmem_pos_enc = self._get_maskmem_pos_enc(inference_state, current_out)
         # object pointer is a small tensor, so we always keep it on GPU memory for fast access
         obj_ptr = current_out["obj_ptr"]
         object_score_logits = current_out["object_score_logits"]
         # make a compact version of this frame's output to reduce the state size
         compact_current_out = {
+            "maskmem_features": maskmem_features,
+            "maskmem_pos_enc": maskmem_pos_enc,
             "pred_masks": pred_masks,
             "obj_ptr": obj_ptr,
             "object_score_logits": object_score_logits,
         }
         return compact_current_out, pred_masks_gpu
         memory also need to be computed again with the memory encoder.
         """
         # Retrieve correct image features
+        _, _, current_vision_feats, _, feat_sizes = self._get_image_feature(inference_state, frame_idx, batch_size)
         maskmem_features, maskmem_pos_enc = self._encode_new_memory(
             current_vision_feats=current_vision_feats,
             feat_sizes=feat_sizes,
         maskmem_features = maskmem_features.to(torch.bfloat16)
         maskmem_features = maskmem_features.to(storage_device, non_blocking=True)
         # "maskmem_pos_enc" is the same across frames, so we only need to store one copy of it
+        maskmem_pos_enc = self._get_maskmem_pos_enc(inference_state, {"maskmem_pos_enc": maskmem_pos_enc})
         return maskmem_features, maskmem_pos_enc
     def _get_maskmem_pos_enc(self, inference_state, current_out):
                 maskmem_pos_enc = model_constants["maskmem_pos_enc"]
             # expand the cached maskmem_pos_enc to the actual batch size
             batch_size = out_maskmem_pos_enc[0].size(0)
+            expanded_maskmem_pos_enc = [x.expand(batch_size, -1, -1, -1) for x in maskmem_pos_enc]
         else:
             expanded_maskmem_pos_enc = None
         return expanded_maskmem_pos_enc
             if not strict:
                 return inference_state["obj_ids"], updated_frames
             raise RuntimeError(
+                f"Cannot remove object id {obj_id} as it doesn't exist. " f"All existing object ids: {inference_state['obj_ids']}."
             )
         # If this is the only remaining object id, we simply reset the state.
         # (note that this step is required as it might downgrade conditioning frames to
         # non-conditioning ones)
         obj_input_frames_inds = set()
+        obj_input_frames_inds.update(inference_state["point_inputs_per_obj"][old_obj_idx_to_rm])
+        obj_input_frames_inds.update(inference_state["mask_inputs_per_obj"][old_obj_idx_to_rm])
         for frame_idx in obj_input_frames_inds:
+            self.clear_all_prompts_in_frame(inference_state, frame_idx, obj_id, need_output=False)
         # Step 1: Update the object id mapping (note that it must be done after Step 0,
         # since Step 0 still requires the old object id mappings in inference_state)
         inference_state["obj_ids"] = new_obj_ids
         # Step 2: For per-object tensor storage, we shift their obj_idx in the dict keys.
         def _map_keys(container):
             new_kvs = []
             for k in old_obj_inds:
         _map_keys(inference_state["mask_inputs_per_obj"])
         _map_keys(inference_state["output_dict_per_obj"])
         _map_keys(inference_state["temp_output_dict_per_obj"])
         _map_keys(inference_state["frames_tracked_per_obj"])
         # Step 3: Further collect the outputs on those frames in `obj_input_frames_inds`, which
         # could show an updated mask for objects previously occluded by the object being removed
         if need_output:
             temp_output_dict_per_obj = inference_state["temp_output_dict_per_obj"]
             for frame_idx in obj_input_frames_inds:
                 is_cond = any(
+                    frame_idx in obj_temp_output_dict["cond_frame_outputs"] for obj_temp_output_dict in temp_output_dict_per_obj.values()
                 )
                 consolidated_out = self._consolidate_temp_output_across_obj(
                     inference_state,
                     frame_idx,
                     is_cond=is_cond,
                     consolidate_at_video_res=True,
                 )
+                _, video_res_masks = self._get_orig_video_res_output(inference_state, consolidated_out["pred_masks_video_res"])
                 updated_frames.append((frame_idx, video_res_masks))
         return inference_state["obj_ids"], updated_frames
         if self.use_high_res_features_in_sam:
             # precompute projected level 0 and level 1 features in SAM decoder
             # to avoid running it again on every SAM click
+            backbone_out["backbone_fpn"][0] = self.sam_mask_decoder.conv_s0(backbone_out["backbone_fpn"][0])
+            backbone_out["backbone_fpn"][1] = self.sam_mask_decoder.conv_s1(backbone_out["backbone_fpn"][1])
         # Clone to help torch.compile
         for i in range(len(backbone_out["backbone_fpn"])):
             backbone_out["backbone_fpn"][i] = backbone_out["backbone_fpn"][i].clone()
+            backbone_out["vision_pos_enc"][i] = backbone_out["vision_pos_enc"][i].clone()
         return backbone_out
     def _forward_sam_heads(
             # optionally, apply non-overlapping constraints to the masks (it's applied
             # in the batch dimension and should only be used during eval, where all
             # the objects come from the same video under batch size 1).
+            pred_masks_high_res = self._apply_non_overlapping_constraints(pred_masks_high_res)
         # scale the raw mask logits with a temperature before applying sigmoid
         binarize = self.binarize_mask_from_pts_for_mem_enc and is_mask_from_pts
         if binarize and not self.training:
             mask_for_mem = mask_for_mem * self.sigmoid_scale_for_mem_enc
         if self.sigmoid_bias_for_mem_enc != 0.0:
             mask_for_mem = mask_for_mem + self.sigmoid_bias_for_mem_enc
+        maskmem_out = self.memory_encoder(pix_feat, mask_for_mem, skip_mask_sigmoid=True)  # sigmoid already applied
         # Clone the feats and pos_enc to enable compilation
         maskmem_features = maskmem_out["vision_features"].clone()
         maskmem_pos_enc = [m.clone() for m in maskmem_out["vision_pos_enc"]]
         # is predicted to be occluded (i.e. no object is appearing in the frame)
         if self.no_obj_embed_spatial is not None:
             is_obj_appearing = (object_score_logits > 0).float()
+            maskmem_features += (1 - is_obj_appearing[..., None, None]) * self.no_obj_embed_spatial[..., None, None].expand(
                 *maskmem_features.shape
             )