Model Card โ€” SegFormer-B0 Kathmandu Valley Satellite Segmentation

Model Description

This model is a fine-tuned SegFormer-B0 for semantic segmentation of satellite imagery over Kathmandu Valley, Nepal. It classifies each pixel into one of 7 land-use categories: Background, Residential Area, Road, River, Forest, and Unused Land. The model is intended for urban planning, GIS analysis, and geospatial research applications.

  • Developed by: praniil
  • Model type: Semantic Segmentation (SegFormer-B0)
  • Language(s) (NLP): N/A (Computer Vision)
  • License: MIT
  • Finetuned from model: nvidia/mit-b0 (SegFormer-B0 pretrained on ImageNet)

Model Sources


Uses

Direct Use

This model can be used out-of-the-box for satellite image segmentation over Kathmandu Valley or similar urban/semi-urban landscapes. It accepts a 512ร—512 RGB satellite image and outputs a per-pixel land-use classification mask.

import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
from transformers import SegformerForSemanticSegmentation, SegformerFeatureExtractor

device = "cuda" if torch.cuda.is_available() else "cpu"

HF_REPO = "Pranilllllll/segformer-satellite-segementation"
model = SegformerForSemanticSegmentation.from_pretrained(HF_REPO).to(device)
processor = SegformerFeatureExtractor.from_pretrained(HF_REPO)
model.eval()

image = Image.open("path_to_your_satellite_image.png").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
pixel_values = inputs["pixel_values"].to(device)

with torch.no_grad():
    outputs = model(pixel_values=pixel_values)
    logits = outputs.logits  # [1, 7, H, W]

pred_mask = torch.argmax(logits, dim=1).squeeze().cpu().numpy()

colors = np.array([
    [0, 0, 0],       # Background
    [128, 0, 0],     # Residential Area
    [0, 128, 0],     # Road
    [0, 0, 128],     # River
    [0, 128, 128],   # Forest
    [128, 128, 0],   # Unused Land
    [128, 0, 128],   # (reserved)
], dtype=np.uint8)

seg_image = colors[pred_mask]
plt.imsave("prediction.png", seg_image)
print("Inference complete. Prediction saved as prediction.png")

Downstream Use

This model can be plugged into larger GIS pipelines for:

  • Automated land-use/land-cover (LULC) mapping
  • Urban sprawl analysis
  • River and forest change detection
  • Input feature generation for spatial planning models

Out-of-Scope Use

  • Not suitable for segmenting non-satellite imagery (street photos, drone footage with different resolution/angle).
  • Performance may degrade on satellite imagery from regions with significantly different land cover patterns than Kathmandu Valley.
  • Not suitable for fine-grained object detection within classes (e.g., identifying individual buildings).

Bias, Risks, and Limitations

  • Geographic bias: Trained exclusively on Kathmandu Valley tiles; may not generalize to other geographies.
  • Class imbalance: Despite weighted loss, rare classes (Road, River) may have lower per-class IoU.
  • Resolution dependency: Expects 512ร—512 input tiles; other resolutions require resizing and may affect accuracy.
  • Annotation noise: Manual annotations via CVAT may have some boundary ambiguity between classes.

Recommendations

Validate predictions on your specific region before using results for critical planning decisions. Cross-checking against GIS datasets (e.g., OpenStreetMap) is recommended.


How to Get Started with the Model

Install dependencies:

pip install torch transformers Pillow matplotlib

Then use the inference script in the Direct Use section above.


Training Details

Training Data

A custom dataset was built from satellite imagery of Kathmandu Valley, Nepal, divided into a grid of tiles.

  • Total images: ~400 tiles
  • Resolution: 512 ร— 512 pixels
  • Annotation tool: CVAT
  • Task: Multi-class semantic segmentation

Annotation Classes

Class ID Class Name RGB Color
0 Background (0, 0, 0)
1 Residential Area (128, 0, 0)
2 Road (0, 128, 0)
3 River (0, 0, 128)
4 Forest (0, 128, 128)
5 Unused Land (128, 128, 0)

Training Procedure

Preprocessing

  • Images resized to 512 ร— 512
  • Standard ImageNet normalization via SegformerFeatureExtractor

Data Augmentation

Applied using albumentations:

  • Horizontal and vertical flips
  • Random 90-degree rotations
  • Resize to 512 ร— 512

Training Hyperparameters

Hyperparameter Value
Input size 512 ร— 512
Batch size 16
Optimizer AdamW
Learning rate 3e-5
Loss function Weighted Cross-Entropy
Epochs 300 (early stopping, patience=25)
Cross-validation 3-fold
Training regime bf16 mixed precision

Class Imbalance Handling

Inverse frequency class weights were computed from the training set and applied to the cross-entropy loss, ensuring rare classes (Road, River) contribute proportionally during training.


Evaluation

Metrics

  • Mean IoU (mIoU): Primary metric โ€” overlap between predicted and ground truth masks averaged across all classes.
  • Per-class IoU: Segmentation accuracy per land-use category.
  • Qualitative inspection: Visual comparison of predicted vs. ground truth masks.

Results

Cross-validation results are reported as mean ยฑ standard deviation of mIoU across 3 folds. Training curves (loss, mIoU, gradient norm) are available in the eval_plots/ directory.

The stable gradient norm across training confirms the MiT encoder converged effectively without vanishing gradient issues.


Model Architecture

  • Backbone: SegFormer-B0 (nvidia/mit-b0)
  • Encoder: MiT (Mix Transformer) โ€” hierarchical global context without positional encoding
  • Decoder: Lightweight MLP head โ€” per-pixel class probability predictions
  • Output: 7-class segmentation mask over a 512ร—512 spatial grid

Environmental Impact

  • Hardware Type: CUDA-enabled GPU
  • Cloud Provider: Not applicable (local training)
  • Compute Region: Nepal
  • Carbon Emitted: Not measured

Citation

@misc{praniil2024kathmandu-segmentation,
  author       = {praniil},
  title        = {Kathmandu Valley Satellite Image Segmentation with SegFormer-B0},
  year         = {2024},
  publisher    = {GitHub},
  howpublished = {\url{https://github.com/praniil/satellite-image-segmentation}},
}

Model Card Authors

praniil

Model Card Contact

Open an issue at https://github.com/praniil/satellite-image-segmentation/issues

Downloads last month
256
Safetensors
Model size
3.72M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support