PanoSAMic
PanoSAMic is a multi-modal semantic segmentation model for panoramic (360°) images. It integrates the frozen Segment Anything Model (SAM) encoder, modified to output multi-stage features, with a spatio-modal fusion module (MCBAM), a spherical-attention semantic decoder, and dual-view fusion to handle the distortion and edge discontinuity of equirectangular images.
- Paper: PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion (ICPR 2026)
- Code: https://github.com/dfki-av/PanoSAMic
- arXiv: https://arxiv.org/abs/2601.07447
- Authors: Mahdi Chamseddine, Didier Stricker, Jason Rambach (DFKI / RPTU Kaiserslautern-Landau)
What is in this repository
Only the trainable PanoSAMic components are hosted here:
- Feature fusion blocks (MCBAM) — spatio-modal cross-attention applied to the branch features extracted by the frozen encoder
- Semantic decoder — convolutional decoder with spherical attention and dual-view fusion head
The full model state dict has two parts:
| Module prefix | Trainable | In Hub checkpoint |
|---|---|---|
feature_fuser.* |
✅ yes | ✅ yes |
semantic_decoder.* |
✅ yes | ✅ yes |
image_encoder.* |
❌ frozen (SAM ViT) | ❌ no |
prompt_encoder.* |
❌ frozen (SAM) | ❌ no |
mask_decoder.* |
❌ frozen (SAM) | ❌ no |
The frozen SAM ViT backbone is NOT hosted here. It is downloaded separately from Meta's official release (Apache-2.0) and combined at load time. This keeps each checkpoint small and avoids redistributing the SAM weights.
Available checkpoints
Each variant lives in its own subfolder of dfki-av/PanoSAMic
(e.g. stanford2d3ds-vith-rgbdn-fold1/model.safetensors).
3-fold checkpoints are published per fold so each can be evaluated on its held-out split.
| Checkpoint | Backbone | Modalities | Dataset | Split |
|---|---|---|---|---|
stanford2d3ds-vith-rgb-fold1 |
ViT-H | RGB | Stanford2D3DS | Fold 1 |
stanford2d3ds-vith-rgb-fold2 |
ViT-H | RGB | Stanford2D3DS | Fold 2 |
stanford2d3ds-vith-rgb-fold3 |
ViT-H | RGB | Stanford2D3DS | Fold 3 |
stanford2d3ds-vith-rgbd-fold1 |
ViT-H | RGB-D | Stanford2D3DS | Fold 1 |
stanford2d3ds-vith-rgbd-fold2 |
ViT-H | RGB-D | Stanford2D3DS | Fold 2 |
stanford2d3ds-vith-rgbd-fold3 |
ViT-H | RGB-D | Stanford2D3DS | Fold 3 |
stanford2d3ds-vith-rgbdn-fold1 |
ViT-H | RGB-D-N | Stanford2D3DS | Fold 1 |
stanford2d3ds-vith-rgbdn-fold2 |
ViT-H | RGB-D-N | Stanford2D3DS | Fold 2 |
stanford2d3ds-vith-rgbdn-fold3 |
ViT-H | RGB-D-N | Stanford2D3DS | Fold 3 |
stanford2d3ds-vitl-rgbdn-fold1 |
ViT-L | RGB-D-N | Stanford2D3DS | Fold 1 |
stanford2d3ds-vitl-rgbdn-fold2 |
ViT-L | RGB-D-N | Stanford2D3DS | Fold 2 |
stanford2d3ds-vitl-rgbdn-fold3 |
ViT-L | RGB-D-N | Stanford2D3DS | Fold 3 |
stanford2d3ds-vitb-rgbdn-fold1 |
ViT-B | RGB-D-N | Stanford2D3DS | Fold 1 |
stanford2d3ds-vitb-rgbdn-fold2 |
ViT-B | RGB-D-N | Stanford2D3DS | Fold 2 |
stanford2d3ds-vitb-rgbdn-fold3 |
ViT-B | RGB-D-N | Stanford2D3DS | Fold 3 |
matterport3d-vith-rgb |
ViT-H | RGB | Matterport3D | BEV360 |
matterport3d-vith-rgbd |
ViT-H | RGB-D | Matterport3D | BEV360 |
Reported results
Stanford2D3DS (3-fold validation), main table:
| Checkpoint | mIoU % | mAcc % | Trainable params (M) |
|---|---|---|---|
stanford2d3ds-vith-rgb |
59.62 | 74.11 | 178 |
stanford2d3ds-vith-rgbd |
60.90 | 73.95 | 184 |
stanford2d3ds-vith-rgbdn |
61.57 | 74.04 | 191 |
Encoder-size study (Stanford2D3DS, 3-fold, RGB-D-N):
| Checkpoint | mIoU % | mAcc % |
|---|---|---|
stanford2d3ds-vitb-rgbdn |
56.68 | 70.49 |
stanford2d3ds-vitl-rgbdn |
60.90 | 73.09 |
stanford2d3ds-vith-rgbdn |
61.57 | 74.04 |
Matterport3D (BEV360 splits):
| Checkpoint | mIoU % |
|---|---|
matterport3d-vith-rgb |
46.59 |
matterport3d-vith-rgbd |
48.43 |
How to reproduce
1. Environment
- Python 3.11+
- Install with
uv syncfrom the GitHub repo (pyproject.tomlpins dependencies) - 1× GPU with ≥16 GB VRAM for ViT-H inference (≥24 GB for training)
2. Get the frozen SAM backbone
Download the official SAM weights from Meta and place them in sam_weights/:
sam_vit_h_4b8939.pthsam_vit_l_0b3195.pthsam_vit_b_01ec64.pth
(See https://github.com/facebookresearch/segment-anything#model-checkpoints)
3. Load a checkpoint
from panosamic.model import PanoSAMic
model = PanoSAMic.from_pretrained_panosamic(
"dfki-av/PanoSAMic",
subfolder="stanford2d3ds-vith-rgbdn-fold1",
config_path="config/config_stanford2d3ds_dv.json",
vit_model="vit_h",
modalities=("image", "depth", "normals"),
num_classes=13,
sam_weights_path="./sam_weights", # omit to auto-download from Meta's servers
)
from_pretrained_panosamic loads only the trainable weights from the Hub,
initialises the frozen SAM backbone from the local sam_weights/ directory
(auto-downloaded if not present), and returns the model in eval() mode.
4. Run inference
import torch
from panosamic.model.instance_semantic_fusion import refine_semantic_with_instances
# batched_input: list of dicts, one per image.
# Each dict maps modality name → float tensor (3, H, W), values in [0, 255].
# Image must be equirectangular 2:1 (e.g. 512 × 1024).
batched_input = [{"image": image_tensor, "depth": depth_tensor, "normals": normals_tensor}]
with torch.no_grad():
outputs = model(batched_input)
sem_preds = outputs[0]["sem_preds"] # (num_classes, H, W) — logits
instance_masks = outputs[0]["instance_masks"]
# Instance-guided refinement: each SAM mask is assigned the majority
# semantic class within it, sharpening boundaries.
if instance_masks:
sem_preds = refine_semantic_with_instances(sem_preds, instance_masks)
seg_map = sem_preds.argmax(dim=0) # (H, W) — integer class indices
5. Prepare the data
Use the exact splits reported in the paper:
- Stanford2D3DS: the authors' 3-fold cross-validation splits. Source:
https://github.com/alexsax/2D-3D-Semantics . Preprocess with
panosamic/data_preparation/into the processed structure documented in the repo README. - Matterport3D: the BEV360 pre-processed data and splits (20-class subset) for a fair comparison. Source: https://github.com/InSAI-Lab/360BEV .
6. Run evaluation
From a released Hub checkpoint (trainable weights only, SAM loaded separately):
python panosamic/evaluation/evaluate.py \
--dataset_path /path/to/processed/dataset \
--config_path config/config_stanford2d3ds_dv.json \
--checkpoint dfki-av/PanoSAMic \
--subfolder stanford2d3ds-vith-rgbdn-fold1 \
--sam_weights_path ./sam_weights \
--dataset stanford2d3ds \
--fold 1 \
--vit_model vit_h \
--modalities image,depth,normals \
--num_gpus 1
From a local training run (full checkpoint including frozen backbone):
python panosamic/evaluation/evaluate.py \
--dataset_path /path/to/processed/dataset \
--config_path config/config_stanford2d3ds_dv.json \
--experiments_path ./experiments \
--dataset stanford2d3ds \
--fold 1 \
--vit_model vit_h \
--modalities image,depth,normals \
--num_gpus 1
Repeat for folds 1–3 and average for the 3-fold numbers. For Matterport3D use
config/config_matterport3d_dv.json, --dataset matterport3d, and the
modalities for that row.
7. Key configuration (matches the paper)
- Frozen SAM ViT-H, encoder depth 32, global attention at blocks [8, 16, 24, 32]
- Batch size 8, 50 epochs, Ranger21 optimizer
- Max LR 0.0005 (Stanford2D3DS) / 0.001 (Matterport3D)
- Input resized to 512 × 1024
- MCBAM window 8×8, stride 4; spherical attention kernel 7×7, stride 1
- Dual-view shift s = W/2
- Loss: Jaccard (Stanford2D3DS); alternating Cross-Entropy/Jaccard schedule (Matterport3D)
- Depth preprocessed to pseudo-disparity (threshold = 99.5th percentile of train depths, rounded to nearest 10 cm), replicated to 3 channels
Intended use and limitations
Indoor panoramic semantic segmentation with RGB / RGB-D / RGB-D-N input. Evaluated only on indoor datasets; outdoor generalization is not guaranteed.
License and access terms
- This model card and the released trainable weights: CC BY-NC-SA 4.0 (Attribution–NonCommercial–ShareAlike). Use is restricted to non-commercial purposes.
- The frozen SAM backbone (downloaded separately) remains under its original Apache-2.0 license from Meta AI.
Citation
@article{chamseddine2026panosamic,
title = {PanoSAMic: Panoramic Image Segmentation from SAM Feature Encoding and Dual View Fusion},
author = {Chamseddine, Mahdi and Stricker, Didier and Rambach, Jason},
journal = {arXiv preprint arXiv:2601.07447},
year = {2026}
}
Acknowledgement
Funded by the European Union as part of the projects HumanTech (Grant Agreement 101058236) and ShieldBOT (Grant Agreement 101235093).