YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SAM ViT-H Fine-Tuning for Exterior Facade Segmentation

Fine-tuning the Segment Anything Model (SAM) ViT-H variant to accurately segment exterior walls and facades of residential and commercial buildings.

Objective

Segment exterior walls and facades of houses/buildings from real-world exterior images. The fine-tuned model focuses on:

Exterior walls and front elevations
Side walls and major architectural surfaces
Visible wall designs and textures belonging to the facade

While avoiding:

Sky, trees, roads, vehicles, people
Standalone windows, standalone doors, balconies alone
Surrounding background unless part of the main facade region

Dataset: CMP Facade Database

Name: CMP Facade Database
Source: HuggingFace Dataset
Original: Center for Machine Perception, ETH Zurich / CTU Prague
License: MIT
Size: 606 rectified facade images
Splits:
- Train: 378 images
- Eval (validation): 114 images
- Test: 114 images
Annotation Format: Multi-class pixel segmentation with 12 architectural classes
- Class 1 = facade (our target)
- Classes 2-12 = windows, doors, cornices, sills, balconies, etc.
Binary Masks: Generated from class 1 only, representing the main facade surface

Model: SAM ViT-H

Checkpoint: facebook/sam-vit-huge
Total Parameters: ~641M
Architecture: Vision Transformer-Huge image encoder + prompt encoder + mask decoder

Loading:

from transformers import SamModel, SamProcessor
model = SamModel.from_pretrained("facebook/sam-vit-huge")
processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")

Fine-Tuning Strategy

Component	Status	Rationale
Image Encoder (ViT-H)	Frozen (~637M params)	Preserves powerful pre-trained visual features; fine-tuning would require massive compute and risk catastrophic forgetting
Prompt Encoder	Fine-tuned	Adapts prompt embeddings to facade domain
Mask Decoder	Fine-tuned (~4M params total trainable)	Adapts mask generation to facade-specific shapes and boundaries

Prompt Strategy: Bounding boxes generated from ground-truth facade masks (with optional jitter for augmentation during training).

Loss: Binary Cross-Entropy between predicted logits and ground-truth binary masks at SAM's native 256x256 output resolution.

Optimizer: Adam, lr=1e-4, StepLR decay at epoch 8 (gamma=0.1)

Training Configuration:

Epochs: 15
Batch size: 4
Augmentation: Bounding box jitter +/-10px

Results

Baseline vs Fine-tuned (Test Set)

Metric	Original SAM ViT-H	Fine-tuned SAM ViT-H	Improvement
IoU	0.1534	0.4371	+184.9%
mIoU	0.1534	0.4371	+184.9%
Dice Score	0.2398	0.5677	+136.8%
Pixel Accuracy	0.5667	0.8580	+51.4%
Precision	0.3897	0.5508	+41.3%
Recall	0.2893	0.7235	+150.1%
FPR	0.3060	0.1348	-55.9% (better)
FNR	0.6844	0.2502	-63.4% (better)

Conclusion: The fine-tuned SAM ViT-H dramatically outperforms the original pre-trained model on facade segmentation, with IoU increasing by 184.9% and Recall by 150.1%. False positive and false negative rates both drop by more than 55%, confirming that the fine-tuned model produces much cleaner, more accurate facade masks.

Qualitative Improvements

Better boundary detection: Fine-tuned model more accurately traces facade edges
Reduced background leakage: Less sky, tree, and road pixels included in predictions
Improved wall texture capture: Better segmentation of facade surfaces with visible textures and designs
Cleaner separation: Windows, doors, and balconies are more cleanly excluded from the facade mask

Project Structure

sam_facade_project/
├── data/
│   └── cmp_facade/              # Downloaded & preprocessed dataset
│       ├── train/
│       ├── eval/
│       └── test/
├── outputs/
│   ├── baseline/                # Baseline evaluation results
│   ├── finetuned/               # Training checkpoints & history
│   ├── finetuned_eval/          # Fine-tuned evaluation results
│   ├── comparison/              # Qualitative comparison images
│   ├── inference/               # Inference examples
│   └── dataset_visualization.png
├── scripts/
│   ├── download_and_prepare_dataset.py
│   ├── dataset.py               # PyTorch Dataset + dataloader
│   ├── metrics.py               # IoU, Dice, Pixel Accuracy, etc.
│   ├── baseline_eval.py         # Baseline evaluation script
│   ├── evaluate.py              # Evaluation script (any checkpoint)
│   ├── qualitative_compare.py   # Side-by-side visual comparison
│   ├── inference.py             # Inference on new images
│   └── visualize_dataset.py     # Dataset visualization
├── notebooks/
│   └── SAM_ViT_H_Facade_Segmentation.ipynb
├── requirements.txt
└── README.md

Usage

1. Setup Environment

pip install -r requirements.txt

2. Prepare Dataset

cd scripts
python download_and_prepare_dataset.py

3. Run Baseline Evaluation

python evaluate.py --data_dir data/cmp_facade --split test --batch_size 2 --output_dir outputs/baseline

4. Evaluate Fine-tuned Model

python evaluate.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/finetuned_eval

5. Qualitative Comparison

python qualitative_compare.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/comparison

6. Inference on New Images

python inference.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --image path/to/image.jpg --output outputs/inference_result.png

Jupyter Notebook

The notebook notebooks/SAM_ViT_H_Facade_Segmentation.ipynb contains a complete, beginner-friendly tutorial covering:

Introduction to facade segmentation
Dataset setup and visualization
Baseline evaluation with original SAM ViT-H
Fine-tuning pipeline explanation
Training loop
Evaluation metrics
Quantitative comparison table
Qualitative comparison (success and failure cases)
Inference on new images
Final results and observations

Colab-friendly: All cells are self-contained; simply adjust file paths if running on Google Colab.

Methodology

Why freeze the image encoder?

SAM ViT-H's image encoder contains 637M parameters pre-trained on 11 million images. Fine-tuning it would:

Require >30GB GPU memory for reasonable batch sizes
Risk catastrophic forgetting of general segmentation knowledge
Take orders of magnitude longer to converge

By freezing the encoder and only fine-tuning the lightweight prompt encoder + mask decoder (~4M params), we:

Preserve powerful zero-shot visual features
Efficiently adapt to the facade domain
Enable training on a single GPU in under 1 hour

Why bounding box prompts?

Buildings have well-defined rectangular extents. Bounding boxes naturally capture this structure and provide a strong initialization for SAM's mask decoder. Alternative point prompts are less reliable for large, contiguous regions like facades.

Training Curves

Training history (loss, IoU, Dice, pixel accuracy) is saved to:

outputs/finetuned/training_history.json
outputs/finetuned/training_curves.png

Future Improvements

Adapter Layers: Insert bottleneck adapters into frozen ViT-H blocks (as in SAM-Med2D / Med-SA) for better domain adaptation without full encoder fine-tuning
Larger Datasets: Combine CMP Facade with street-view facade datasets (e.g., ECP dataset) for more diverse architectural styles
Interactive Refinement: Implement iterative point-sampling from error regions (as in SAM-Med2D) for higher IoU
HQ-SAM Integration: Add a high-quality output token to improve boundary precision
Multi-class Segmentation: Extend from binary facade masks to multi-class facade element segmentation (windows, doors, balconies)

Citation

If you use this project, please cite:

SAM: Kirillov et al., 2023
CMP Facade: Tylecek & Sara, 2013

License

This project is provided for research and educational purposes. The CMP Facade dataset follows its original license (MIT). SAM follows the Apache 2.0 license.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for acd23/sam-vit-h-facade-segmentation-project

Segment Anything

Paper • 2304.02643 • Published Apr 5, 2023 • 6