YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SAM ViT-H Fine-Tuning for Exterior Facade Segmentation
Fine-tuning the Segment Anything Model (SAM) ViT-H variant to accurately segment exterior walls and facades of residential and commercial buildings.
Objective
Segment exterior walls and facades of houses/buildings from real-world exterior images. The fine-tuned model focuses on:
- Exterior walls and front elevations
- Side walls and major architectural surfaces
- Visible wall designs and textures belonging to the facade
While avoiding:
- Sky, trees, roads, vehicles, people
- Standalone windows, standalone doors, balconies alone
- Surrounding background unless part of the main facade region
Dataset: CMP Facade Database
- Name: CMP Facade Database
- Source: HuggingFace Dataset
- Original: Center for Machine Perception, ETH Zurich / CTU Prague
- License: MIT
- Size: 606 rectified facade images
- Splits:
- Train: 378 images
- Eval (validation): 114 images
- Test: 114 images
- Annotation Format: Multi-class pixel segmentation with 12 architectural classes
- Class 1 = facade (our target)
- Classes 2-12 = windows, doors, cornices, sills, balconies, etc.
- Binary Masks: Generated from class 1 only, representing the main facade surface
Model: SAM ViT-H
- Checkpoint:
facebook/sam-vit-huge - Total Parameters: ~641M
- Architecture: Vision Transformer-Huge image encoder + prompt encoder + mask decoder
- Loading:
from transformers import SamModel, SamProcessor model = SamModel.from_pretrained("facebook/sam-vit-huge") processor = SamProcessor.from_pretrained("facebook/sam-vit-huge")
Fine-Tuning Strategy
| Component | Status | Rationale |
|---|---|---|
| Image Encoder (ViT-H) | Frozen (~637M params) | Preserves powerful pre-trained visual features; fine-tuning would require massive compute and risk catastrophic forgetting |
| Prompt Encoder | Fine-tuned | Adapts prompt embeddings to facade domain |
| Mask Decoder | Fine-tuned (~4M params total trainable) | Adapts mask generation to facade-specific shapes and boundaries |
Prompt Strategy: Bounding boxes generated from ground-truth facade masks (with optional jitter for augmentation during training).
Loss: Binary Cross-Entropy between predicted logits and ground-truth binary masks at SAM's native 256x256 output resolution.
Optimizer: Adam, lr=1e-4, StepLR decay at epoch 8 (gamma=0.1)
Training Configuration:
- Epochs: 15
- Batch size: 4
- Augmentation: Bounding box jitter +/-10px
Results
Baseline vs Fine-tuned (Test Set)
| Metric | Original SAM ViT-H | Fine-tuned SAM ViT-H | Improvement |
|---|---|---|---|
| IoU | 0.1534 | 0.4371 | +184.9% |
| mIoU | 0.1534 | 0.4371 | +184.9% |
| Dice Score | 0.2398 | 0.5677 | +136.8% |
| Pixel Accuracy | 0.5667 | 0.8580 | +51.4% |
| Precision | 0.3897 | 0.5508 | +41.3% |
| Recall | 0.2893 | 0.7235 | +150.1% |
| FPR | 0.3060 | 0.1348 | -55.9% (better) |
| FNR | 0.6844 | 0.2502 | -63.4% (better) |
Conclusion: The fine-tuned SAM ViT-H dramatically outperforms the original pre-trained model on facade segmentation, with IoU increasing by 184.9% and Recall by 150.1%. False positive and false negative rates both drop by more than 55%, confirming that the fine-tuned model produces much cleaner, more accurate facade masks.
Qualitative Improvements
- Better boundary detection: Fine-tuned model more accurately traces facade edges
- Reduced background leakage: Less sky, tree, and road pixels included in predictions
- Improved wall texture capture: Better segmentation of facade surfaces with visible textures and designs
- Cleaner separation: Windows, doors, and balconies are more cleanly excluded from the facade mask
Project Structure
sam_facade_project/
βββ data/
β βββ cmp_facade/ # Downloaded & preprocessed dataset
β βββ train/
β βββ eval/
β βββ test/
βββ outputs/
β βββ baseline/ # Baseline evaluation results
β βββ finetuned/ # Training checkpoints & history
β βββ finetuned_eval/ # Fine-tuned evaluation results
β βββ comparison/ # Qualitative comparison images
β βββ inference/ # Inference examples
β βββ dataset_visualization.png
βββ scripts/
β βββ download_and_prepare_dataset.py
β βββ dataset.py # PyTorch Dataset + dataloader
β βββ metrics.py # IoU, Dice, Pixel Accuracy, etc.
β βββ baseline_eval.py # Baseline evaluation script
β βββ evaluate.py # Evaluation script (any checkpoint)
β βββ qualitative_compare.py # Side-by-side visual comparison
β βββ inference.py # Inference on new images
β βββ visualize_dataset.py # Dataset visualization
βββ notebooks/
β βββ SAM_ViT_H_Facade_Segmentation.ipynb
βββ requirements.txt
βββ README.md
Usage
1. Setup Environment
pip install -r requirements.txt
2. Prepare Dataset
cd scripts
python download_and_prepare_dataset.py
3. Run Baseline Evaluation
python evaluate.py --data_dir data/cmp_facade --split test --batch_size 2 --output_dir outputs/baseline
4. Evaluate Fine-tuned Model
python evaluate.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/finetuned_eval
5. Qualitative Comparison
python qualitative_compare.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --data_dir data/cmp_facade --split test --output_dir outputs/comparison
6. Inference on New Images
python inference.py --checkpoint outputs/finetuned/best_sam_vit_h_facade.pth --image path/to/image.jpg --output outputs/inference_result.png
Jupyter Notebook
The notebook notebooks/SAM_ViT_H_Facade_Segmentation.ipynb contains a complete, beginner-friendly tutorial covering:
- Introduction to facade segmentation
- Dataset setup and visualization
- Baseline evaluation with original SAM ViT-H
- Fine-tuning pipeline explanation
- Training loop
- Evaluation metrics
- Quantitative comparison table
- Qualitative comparison (success and failure cases)
- Inference on new images
- Final results and observations
Colab-friendly: All cells are self-contained; simply adjust file paths if running on Google Colab.
Methodology
Why freeze the image encoder?
SAM ViT-H's image encoder contains 637M parameters pre-trained on 11 million images. Fine-tuning it would:
- Require >30GB GPU memory for reasonable batch sizes
- Risk catastrophic forgetting of general segmentation knowledge
- Take orders of magnitude longer to converge
By freezing the encoder and only fine-tuning the lightweight prompt encoder + mask decoder (~4M params), we:
- Preserve powerful zero-shot visual features
- Efficiently adapt to the facade domain
- Enable training on a single GPU in under 1 hour
Why bounding box prompts?
Buildings have well-defined rectangular extents. Bounding boxes naturally capture this structure and provide a strong initialization for SAM's mask decoder. Alternative point prompts are less reliable for large, contiguous regions like facades.
Training Curves
Training history (loss, IoU, Dice, pixel accuracy) is saved to:
outputs/finetuned/training_history.jsonoutputs/finetuned/training_curves.png
Future Improvements
- Adapter Layers: Insert bottleneck adapters into frozen ViT-H blocks (as in SAM-Med2D / Med-SA) for better domain adaptation without full encoder fine-tuning
- Larger Datasets: Combine CMP Facade with street-view facade datasets (e.g., ECP dataset) for more diverse architectural styles
- Interactive Refinement: Implement iterative point-sampling from error regions (as in SAM-Med2D) for higher IoU
- HQ-SAM Integration: Add a high-quality output token to improve boundary precision
- Multi-class Segmentation: Extend from binary facade masks to multi-class facade element segmentation (windows, doors, balconies)
Citation
If you use this project, please cite:
- SAM: Kirillov et al., 2023
- CMP Facade: Tylecek & Sara, 2013
License
This project is provided for research and educational purposes. The CMP Facade dataset follows its original license (MIT). SAM follows the Apache 2.0 license.