@Sathya77 on Hugging Face: "Trained a Swin-T from scratch on NWPU-RESISC45

Post

105

Trained a Swin-T from scratch on NWPU-RESISC45 — no pretrained weights, no fine-tuning.

Every component hand-coded in PyTorch: window partitioning, shifted window attention with relative positional bias, patch merging across 4 stages, ~28M parameters.

Architecture:

embed_dim=96, window_size=7, depths=[2, 2, 6, 2]
heads=[3, 6, 12, 24] across stages
Patch embed via Conv2d (4×4, stride 4) → 56×56 feature map
PatchMerging downsamples by concatenating 2×2 neighbors + linear projection
Global average pooling → linear classifier

Training:

AdamW (lr=3e-4, weight_decay=0.05)
Cosine annealing with 3-epoch linear warmup over 20 epochs
Mixed precision (autocast + GradScaler)
Gradient clipping (max_norm=1.0)
Label smoothing (0.1)
ImageNet normalization, batch size 32
80/20 train/test split, seed=42

Result: 82% test accuracy on 45 land-use categories, 31,500 images.
🔗 Sathya77/swin-transformer-satellite

What accuracy do you think is achievable on NWPU-RESISC45 with Swin-T trained from scratch, without any pretraining?

Join the conversation