Post
105
Trained a Swin-T from scratch on NWPU-RESISC45 — no pretrained weights, no fine-tuning.
Every component hand-coded in PyTorch: window partitioning, shifted window attention with relative positional bias, patch merging across 4 stages, ~28M parameters.
Architecture:
embed_dim=96, window_size=7, depths=[2, 2, 6, 2]
heads=[3, 6, 12, 24] across stages
Patch embed via Conv2d (4×4, stride 4) → 56×56 feature map
PatchMerging downsamples by concatenating 2×2 neighbors + linear projection
Global average pooling → linear classifier
Training:
AdamW (lr=3e-4, weight_decay=0.05)
Cosine annealing with 3-epoch linear warmup over 20 epochs
Mixed precision (autocast + GradScaler)
Gradient clipping (max_norm=1.0)
Label smoothing (0.1)
ImageNet normalization, batch size 32
80/20 train/test split, seed=42
Result: 82% test accuracy on 45 land-use categories, 31,500 images.
🔗 Sathya77/swin-transformer-satellite
What accuracy do you think is achievable on NWPU-RESISC45 with Swin-T trained from scratch, without any pretraining?
Every component hand-coded in PyTorch: window partitioning, shifted window attention with relative positional bias, patch merging across 4 stages, ~28M parameters.
Architecture:
embed_dim=96, window_size=7, depths=[2, 2, 6, 2]
heads=[3, 6, 12, 24] across stages
Patch embed via Conv2d (4×4, stride 4) → 56×56 feature map
PatchMerging downsamples by concatenating 2×2 neighbors + linear projection
Global average pooling → linear classifier
Training:
AdamW (lr=3e-4, weight_decay=0.05)
Cosine annealing with 3-epoch linear warmup over 20 epochs
Mixed precision (autocast + GradScaler)
Gradient clipping (max_norm=1.0)
Label smoothing (0.1)
ImageNet normalization, batch size 32
80/20 train/test split, seed=42
Result: 82% test accuracy on 45 land-use categories, 31,500 images.
🔗 Sathya77/swin-transformer-satellite
What accuracy do you think is achievable on NWPU-RESISC45 with Swin-T trained from scratch, without any pretraining?