🌵 SegFormer-B2 — Offroad Desert Semantic Segmentation

YOLO Pune Hackathon 2026 · Duality AI × MIT WPU

Fine-tuned nvidia/mit-b2 on synthetic desert imagery from Duality AI's Falcon simulation platform. The model segments every pixel of a desert scene into one of 10 classes. It was trained on Desert A and evaluated on a completely unseen Desert B location — a domain shift challenge.

📊 Results

Split	mIoU	Notes
Val (Desert A holdout)	0.6293	317 images, dedicated Duality AI split
Test (Desert B — raw)	0.2843	1002 images, different location
Test (Desert B — corrected)	0.4061	7 present classes only; Flowers, Logs, Ground Clutter absent in Desert B

Per-class IoU (Test Set — Desert B)

Class	Val IoU	Test IoU	Present in Desert B
🌳 Trees	0.8572	0.3896	✅
🌿 Lush Bushes	0.6937	0.0003	✅
🌾 Dry Grass	0.6923	0.4502	✅
🪨 Dry Bushes	0.4929	0.3831	✅
🌸 Flowers	0.5757	0.0000	❌ Absent
🪵 Logs	0.5214	0.0000	❌ Absent
⛰️ Rocks	0.4877	0.0402	✅
🏜️ Landscape	0.6007	0.5993	✅
☁️ Sky	0.9839	0.9802	✅
🪨 Ground Clutter	0.3874	0.0000	❌ Absent

Note on corrected mIoU: The 3 absent classes (Flowers, Logs, Ground Clutter) score IoU=0 by definition — the model never sees them in the test set. Corrected mIoU averages only the 7 classes that actually appear in Desert B.

🏗️ Model Details

Property	Value
Architecture	SegFormer-B2 (Mix-Transformer encoder + All-MLP decoder)
Base model	nvidia/mit-b2 (ImageNet-1K pretrained)
Total parameters	27.4M
Encoder	23.7M (pretrained)
Decoder head	3.7M (randomly initialised, fine-tuned)
Input resolution	512 × 512 px
Output classes	10
Training platform	Kaggle GPU (T4/P100)

🗂️ Dataset

Synthetic desert images generated by Duality AI's Falcon simulation platform.

Split	Images	Masks	Location
Train	2,857	2,857	Desert A
Val	317	317	Desert A (dedicated split)
Test	1,002	1,002	Desert B (unseen)

Mask label IDs — non-standard sparse integers remapped to 0–9:

Raw ID	Class	Compact ID
100	Trees	0
200	Lush Bushes	1
300	Dry Grass	2
500	Dry Bushes	3
550	Ground Clutter	4
600	Flowers ⚑ rare	5
700	Logs ⚑ rare	6
800	Rocks	7
7100	Landscape	8
10000	Sky	9

⚙️ Training Configuration

Optimiser & Schedule

Optimiser   : AdamW   lr=6e-5   weight_decay=1e-2
LR schedule : CosineAnnealingLR   T_max=50   eta_min=1e-7
Grad clip   : max_norm=1.0
Early stop  : patience=7 epochs on val mIoU
Best epoch  : 19   (V2 model)

Loss Function

Weighted CrossEntropyLoss with ignore_index=255 to handle unlabelled pixels.

Class	Weight	Reason
Flowers	5.0×	Extremely rare — forces model to notice them
Logs	4.0×	Rare and often partially occluded
Ground Clutter	2.0×	Small objects, easily missed
Lush Bushes, Dry Bushes	2.0×	Medium frequency
Trees, Dry Grass, Rocks	1.5×	Moderate
Sky	0.8×	Dominant — downweighted
Landscape	0.5×	Most dominant — heavily downweighted

Augmentation Pipeline

All four techniques recommended in Duality AI's official training guide:

A.Resize(512, 512),
A.HorizontalFlip(p=0.5),                         # 1. Flip — deserts have no L/R bias
A.RandomResizedCrop(size=(512,512),               # 2. Zoom — simulates camera distance
    scale=(0.6, 1.0), ratio=(0.75, 1.33), p=0.5),
A.Affine(shear=(-15,15), rotate=(-10,10), p=0.4), # 3. Shear — off-level camera angles
# 4. Mosaic — 4 images stitched into 2×2 grid (30% probability, custom implementation)
A.HueSaturationValue(                             # 5. HSV — MILD to preserve warm desert palette
    hue_shift_limit=10, sat_shift_limit=20,
    val_shift_limit=15, p=0.4),
A.RandomBrightnessContrast(
    brightness_limit=0.15, contrast_limit=0.15, p=0.4),
A.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),

⚠️ HSV jitter is intentionally mild. The Duality AI desert palette is warm-toned. Aggressive colour shifts would train the model on unrealistic lighting and hurt generalisation.

🚀 Usage

Quick inference

import torch
import numpy as np
from PIL import Image
from transformers import SegformerForSemanticSegmentation
import albumentations as A
from albumentations.pytorch import ToTensorV2
import torch.nn.functional as F

# ── Class definitions ────────────────────────────────────────────────────────
CLASS_NAMES = [
    "Trees", "Lush_Bushes", "Dry_Grass", "Dry_Bushes",
    "Ground_Clutter", "Flowers", "Logs", "Rocks", "Landscape", "Sky"
]
PALETTE = np.array([
    [34,139,34],[0,200,83],[210,180,140],[200,200,180],[139,90,43],
    [255,20,147],[139,69,19],[128,128,128],[205,170,100],[135,206,235]
], dtype=np.uint8)

# ── Load model ───────────────────────────────────────────────────────────────
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model  = SegformerForSemanticSegmentation.from_pretrained(
    "YOUR_HF_USERNAME/segformer-b2-desert-segmentation"
).to(device).eval()

# ── Preprocess ───────────────────────────────────────────────────────────────
transform = A.Compose([
    A.Resize(512, 512),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
])

# ── Inference ────────────────────────────────────────────────────────────────
img    = np.array(Image.open("desert_image.png").convert("RGB"))
h0, w0 = img.shape[:2]
tensor = transform(image=img)["image"].unsqueeze(0).to(device)

with torch.no_grad():
    logits = model(pixel_values=tensor).logits          # [1, 10, H/4, W/4]
    up     = F.interpolate(logits, size=(h0, w0),
                           mode="bilinear", align_corners=False)
    pred   = up.argmax(dim=1).squeeze(0).cpu().numpy()  # [H, W]  values 0–9

# ── Colour visualisation ─────────────────────────────────────────────────────
def mask_to_rgb(mask):
    rgb = np.zeros((*mask.shape, 3), dtype=np.uint8)
    for c in range(10):
        rgb[mask == c] = PALETTE[c]
    return rgb

pred_rgb = mask_to_rgb(pred)
Image.fromarray(pred_rgb).save("segmentation_output.png")
print("Classes found:", [CLASS_NAMES[c] for c in np.unique(pred)])

Load from checkpoint

import torch
from transformers import SegformerForSemanticSegmentation

# HuggingFace directory
model = SegformerForSemanticSegmentation.from_pretrained(
    "YOUR_HF_USERNAME/segformer-b2-desert-segmentation"
).eval()

# Raw .pth checkpoint (if you downloaded it separately)
ckpt = torch.load("best_model.pth", map_location="cpu")
model.load_state_dict(ckpt["model_state"])
print(f"Loaded epoch {ckpt['epoch']} — val mIoU {ckpt['val_miou']:.4f}")

⚠️ Limitations & Known Issues

Domain shift is real. The model was trained on Desert A (Duality AI Falcon synthetic data). Performance on real-world desert images or different Falcon biomes may vary significantly.
3 classes absent in Desert B. Flowers, Logs, and Ground Clutter do not appear in the test location. The model has learned to predict them on Desert A but will produce near-zero IoU on any location where they are absent.
Lush Bushes test generalisation is poor (IoU 0.6937 val → 0.0003 test). Desert B appears to have a very different bush distribution or colour tone from Desert A.
Rocks generalise weakly (0.4877 → 0.0402). Rock textures vary heavily between locations.
Sky and Landscape generalise near-perfectly (Sky: 0.9839 → 0.9802; Landscape: 0.6007 → 0.5993). These classes are visually consistent across desert biomes.

📁 Repository Structure

segformer-b2-desert-segmentation/
├── config.json                  ← model architecture config
├── model.safetensors            ← fine-tuned weights (404 MB)
├── preprocessor_config.json     ← image processor settings
├── metadata.json                ← training metadata & scores
└── README.md                    ← this file

📚 Citation

If you use this model, please cite:

@misc{mitwpu2025desert,
  title        = {SegFormer-B2 Fine-tuned on Duality AI Desert Segmentation},
  author       = {MIT WPU Team},
  year         = {2025},
  howpublished = {YOLO Pune Hackathon 2025, Duality AI Challenge},
  url          = {https://huggingface.co/YOUR_HF_USERNAME/segformer-b2-desert-segmentation}
}

Base model:

@article{xie2021segformer,
  title   = {SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers},
  author  = {Xie, Enze and Wang, Wenhai and Yu, Zhiding and Anandkumar, Anima and Alvarez, Jose M and Luo, Ping},
  journal = {NeurIPS},
  year    = {2021}
}

🏅 Acknowledgements

Duality AI for the Falcon synthetic dataset and challenge
NVIDIA for the pretrained SegFormer-B2 backbone
YOLO Pune Hackathon 2025 organisers at MIT WPU

Model trained and evaluated by MIT WPU for the Duality AI Offroad Segmentation challenge at YOLO Pune Hackathon 2026.

Downloads last month: 4

Safetensors

Model size

27.4M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support