YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SAM ViT-H Fine-Tuning for Exterior Facade Segmentation

Fine-tuning the Segment Anything Model (SAM) ViT-H variant to accurately segment exterior walls and facades of residential and commercial buildings.

Objective

Segment exterior walls and facades of houses/buildings from real-world exterior images. The fine-tuned model focuses on:

  • Exterior walls and front elevations
  • Side walls and major architectural surfaces
  • Visible wall designs and textures belonging to the facade

While avoiding:

  • Sky, trees, roads, vehicles, people
  • Standalone windows, standalone doors, balconies alone
  • Surrounding background unless part of the main facade region

Dataset: CMP Facade Database

  • Name: CMP Facade Database
  • Source: HuggingFace Dataset
  • Original: Center for Machine Perception, ETH Zurich / CTU Prague
  • License: MIT
  • Size: 606 rectified facade images
  • Splits:
    • Train: 378 images
    • Eval (validation): 114 images
    • Test: 114 images
  • Annotation Format: Multi-class pixel segmentation with 12 architectural classes
    • Class 1 = facade (our target)
    • Classes 2-12 = windows, doors, cornices, sills, balconies, etc.
  • Binary Masks: Generated from class 1 only, representing the main facade surface

Model: SAM ViT-H

  • Checkpoint: facebook/sam-vit-huge
  • Total Parameters: ~641M
  • Architecture: Vision Transformer-Huge image encoder + prompt encoder + mask decoder
  • Loading:
    from transformers import SamModel, SamProcessor
    model = SamModel.from_pretrained("facebook/sam-vit-huge")
    processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
    

Fine-Tuning Strategy

Component Status Rationale
Image Encoder (ViT-H) Frozen (~637M params) Preserves powerful pre-trained visual features; fine-tuning would require massive compute and risk catastrophic forgetting
Prompt Encoder Fine-tuned Adapts prompt embeddings to facade domain
Mask Decoder Fine-tuned (~4M params total trainable) Adapts mask generation to facade-specific shapes and boundaries

Prompt Strategy: Bounding boxes generated from ground-truth facade masks (with optional jitter for augmentation during training).

Loss: Binary Cross-Entropy between predicted logits and ground-truth binary masks at SAM's native 256x256 output resolution.

Optimizer: Adam, lr=1e-4, StepLR decay at epoch 8 (gamma=0.1)

Training Configuration:

  • Epochs: 15
  • Batch size: 4
  • Augmentation: Bounding box jitter +/-10px

Results

Baseline vs Fine-tuned (Test Set)

Metric Original SAM ViT-H Fine-tuned SAM ViT-H Improvement
IoU 0.1534 0.4371 +184.9%
mIoU 0.1534 0.4371 +184.9%
Dice Score 0.2398 0.5677 +136.8%
Pixel Accuracy 0.5667 0.8580 +51.4%
Precision 0.3897 0.5508 +41.3%
Recall 0.2893 0.7235 +150.1%
FPR 0.3060 0.1348 -55.9% (better)
FNR 0.6844 0.2502 -63.4% (better)

Conclusion: The fine-tuned SAM ViT-H dramatically outperforms the original pre-trained model on facade segmentation, with IoU increasing by 184.9% and Recall by 150.1%. False positive and false negative rates both drop by more than 55%, confirming that the fine-tuned model produces much cleaner, more accurate facade masks.

Qualitative Improvements

  • Better boundary detection: Fine-tuned model more accurately traces facade edges
  • Reduced background leakage: Less sky, tree, and road pixels included in predictions
  • Improved wall texture capture: Better segmentation of facade surfaces with visible textures and designs
  • Cleaner separation: Windows, doors, and balconies are more cleanly excluded from the facade mask

Project Structure

sam_facade_project/
β”œβ”€β”€ data/
β”‚   └── cmp_facade/              # Downloaded & preprocessed dataset
β”‚       β”œβ”€β”€ train/
β”‚       β”œβ”€β”€ eval/
β”‚       └── test/
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ baseline/                # Baseline evaluation results
β”‚   β”œβ”€β”€ finetuned/               # Training checkpoints & history
β”‚   β”œβ”€β”€ finetuned_eval/          # Fine-tuned evaluation results
β”‚   β”œβ”€β”€ comparison/              # Qualitative comparison images
β”‚   β”œβ”€β”€ inference/               # Inference examples
β”‚   └── dataset_visualization.png
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ download_and_prepare_dataset.py
β”‚   β”œβ”€β”€ dataset.py               # PyTorch Dataset + dataloader
β”‚   β”œβ”€β”€ metrics.py               # IoU, Dice, Pixel Accuracy, etc.
β”‚   β”œβ”€β”€ baseline_eval.py         # Baseline evaluation script
β”‚   β”œβ”€β”€ evaluate.py              # Evaluation script (any checkpoint)
β”‚   β”œβ”€β”€ qualitative_compare.py   # Side-by-side visual comparison
β”‚   β”œβ”€β”€ inference.py             # Inference on new images
β”‚   └── visualize_dataset.py     # Dataset visualization
β”œβ”€β”€ notebooks/
β”‚   └── SAM_ViT_H_Facade_Segmentation.ipynb
β”œβ”€β”€ requirements.txt
└── README.md

Usage

1. Setup Environment

pip install -r requirements.txt

2. Prepare Dataset

cd scripts
python download_and_prepare_dataset.py

3. Run Baseline Evaluation

python evaluate.py --data_dir data/cmp_facade --split test --batch_size 2 --output_dir outputs/baseline

4. Evaluate Fine-tuned Model

python evaluate.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/finetuned_eval

5. Qualitative Comparison

python qualitative_compare.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/comparison

6. Inference on New Images

python inference.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --image path/to/image.jpg --output outputs/inference_result.png

Jupyter Notebook

The notebook notebooks/SAM_ViT_H_Facade_Segmentation.ipynb contains a complete, beginner-friendly tutorial covering:

  1. Introduction to facade segmentation
  2. Dataset setup and visualization
  3. Baseline evaluation with original SAM ViT-H
  4. Fine-tuning pipeline explanation
  5. Training loop
  6. Evaluation metrics
  7. Quantitative comparison table
  8. Qualitative comparison (success and failure cases)
  9. Inference on new images
  10. Final results and observations

Colab-friendly: All cells are self-contained; simply adjust file paths if running on Google Colab.

Methodology

Why freeze the image encoder?

SAM ViT-H's image encoder contains 637M parameters pre-trained on 11 million images. Fine-tuning it would:

  • Require >30GB GPU memory for reasonable batch sizes
  • Risk catastrophic forgetting of general segmentation knowledge
  • Take orders of magnitude longer to converge

By freezing the encoder and only fine-tuning the lightweight prompt encoder + mask decoder (~4M params), we:

  • Preserve powerful zero-shot visual features
  • Efficiently adapt to the facade domain
  • Enable training on a single GPU in under 1 hour

Why bounding box prompts?

Buildings have well-defined rectangular extents. Bounding boxes naturally capture this structure and provide a strong initialization for SAM's mask decoder. Alternative point prompts are less reliable for large, contiguous regions like facades.

Training Curves

Training history (loss, IoU, Dice, pixel accuracy) is saved to:

  • outputs/finetuned/training_history.json
  • outputs/finetuned/training_curves.png

Future Improvements

  1. Adapter Layers: Insert bottleneck adapters into frozen ViT-H blocks (as in SAM-Med2D / Med-SA) for better domain adaptation without full encoder fine-tuning
  2. Larger Datasets: Combine CMP Facade with street-view facade datasets (e.g., ECP dataset) for more diverse architectural styles
  3. Interactive Refinement: Implement iterative point-sampling from error regions (as in SAM-Med2D) for higher IoU
  4. HQ-SAM Integration: Add a high-quality output token to improve boundary precision
  5. Multi-class Segmentation: Extend from binary facade masks to multi-class facade element segmentation (windows, doors, balconies)

Citation

If you use this project, please cite:

License

This project is provided for research and educational purposes. The CMP Facade dataset follows its original license (MIT). SAM follows the Apache 2.0 license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for acd23/sam-vit-h-facade-segmentation-project