WHU Building Detection — EfficientNet-B4 + UNet++

A semantic segmentation model for building detection in high-resolution aerial imagery, trained on the WHU Building Dataset.

Model Description

Property	Value
Architecture	UNet++
Encoder	EfficientNet-B4 (ImageNet pretrained)
Framework	segmentation-models-pytorch (SMP)
Training Framework	PyTorch Lightning
Input	3-channel RGB, 512x512 tiles
Output	2-class mask (Background=0, Building=1)
Parameters	~20.8M
Model Size	~84 MB

Performance

Evaluated on the WHU Building Dataset test split (1,228 tiles):

Metric	Score
IoU	0.9054
Dice	0.9503
Best Val IoU	0.9434

Training Details

Dataset: WHU Building Dataset — 5,732 training tiles (512x512 RGB at 0.3m resolution)
Validation split: 20% of training data
Optimizer: AdamW (lr=1e-4, weight_decay=1e-4)
Loss: CrossEntropyLoss
Epochs: 36 (early stopping, patience=10)
Batch size: 16
GPU: NVIDIA RTX 6000 Ada (48GB)
Encoder weights: ImageNet pretrained

Quick Start

Installation

pip install geoai-py timm segmentation-models-pytorch

Inference with GeoAI

import geoai

# Run building detection on a GeoTIFF
geoai.timm_segmentation_from_hub(
    input_path="input_image.tif",
    output_path="building_prediction.tif",
    repo_id="giswqs/whu-building-unetplusplus-efficientnet-b4",
    window_size=512,
    overlap=256,
    batch_size=4,
)

# Vectorize to building footprints
gdf = geoai.orthogonalize(
    input_path="building_prediction.tif",
    output_path="building_footprints.geojson",
    epsilon=2.0,
)

Manual Loading

import json
import torch
import segmentation_models_pytorch as smp

# Load config
with open("config.json") as f:
    config = json.load(f)

# Create model
model = smp.UnetPlusPlus(
    encoder_name="efficientnet-b4",
    encoder_weights=None,
    in_channels=3,
    classes=2,
)

# Load weights
state_dict = torch.load("model.pth", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

Example Notebook

See the full inference notebook with visualization and analysis:

Dataset

The WHU Building Dataset consists of aerial imagery at 0.3m resolution with binary building masks:

Train: 5,732 tiles (512x512 RGB)
Val: 1,228 tiles
Test: 1,228 tiles

Reference

Ji, S., Wei, S., & Lu, M. (2019). Fully Convolutional Networks for Multisource Building Identification. IEEE Transactions on Geoscience and Remote Sensing, 57(1), 108-120.

License

This model is released under the Apache 2.0 License.

giswqs
/

whu-building-unetplusplus-efficientnet-b4