RSBuilding-Swin-T

HuggingFace Transformers version of RSBuilding Swin-Tiny model, converted from MMDetection/MMSegmentation format.

Source

Model Information

  • Architecture: Swin Transformer Tiny
  • Embedding Dimension: 96
  • Depths: [2, 2, 6, 2]
  • Number of Heads: [3, 6, 12, 24]
  • Window Size: 7
  • Image Size: 224×224
  • Patch Size: 4×4

Important Notes

Missing Buffer Keys (Expected)

When loading this model, you may see messages about missing buffer keys (typically ~12 keys). This is expected and normal.

These missing keys are buffers that are computed dynamically during model initialization:

  • relative_position_index: Precomputed index mapping for window-based attention
  • relative_coords_table: Precomputed coordinate table for relative positions
  • relative_position_bias_table: Precomputed bias table

Why they're missing:

  • These buffers are recalculated each time the model is instantiated based on window_size and other configuration parameters
  • They don't need to be saved in checkpoints because they're deterministic and computed from config
  • This is standard behavior in HuggingFace Swin transformers

Action required: None. The model will work correctly with these buffers computed automatically.

Quick Start

Installation

pip install transformers torch pillow

Inference Example

from transformers import SwinModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and processor
model = SwinModel.from_pretrained("BiliSakura/RSBuilding-Swin-T")
processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-Swin-T")

# Load and process image
image = Image.open("your_image.jpg")
inputs = processor(image, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Get features
# outputs.last_hidden_state: (batch_size, num_patches, hidden_size)
# outputs.pooler_output: (batch_size, hidden_size) - pooled representation
features = outputs.last_hidden_state
pooled_features = outputs.pooler_output

print(f"Feature shape: {features.shape}")
print(f"Pooled feature shape: {pooled_features.shape}")

Feature Extraction for Downstream Tasks

from transformers import SwinModel, AutoImageProcessor
import torch

model = SwinModel.from_pretrained("BiliSakura/RSBuilding-Swin-T")
processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-Swin-T")

# Process image
image = Image.open("your_image.jpg")
inputs = processor(image, return_tensors="pt")

# Extract features
with torch.no_grad():
    outputs = model(**inputs)
    
# Use pooled features for classification/regression
features = outputs.pooler_output  # Shape: (1, 768)

# Or use last hidden state for dense prediction tasks
spatial_features = outputs.last_hidden_state  # Shape: (1, num_patches, 768)

Model Configuration

The model uses the following configuration:

  • image_size: 224
  • patch_size: 4
  • num_channels: 3
  • embed_dim: 96
  • depths: [2, 2, 6, 2]
  • num_heads: [3, 6, 12, 24]
  • window_size: 7
  • mlp_ratio: 4.0
  • hidden_act: "gelu"

Citation

If you use this model, please cite the original RSBuilding paper:

@article{wangRSBuildingGeneralRemote2024a,
  title = {{{RSBuilding}}: {{Toward General Remote Sensing Image Building Extraction}} and {{Change Detection With Foundation Model}}},
  shorttitle = {{{RSBuilding}}},
  author = {Wang, Mingze and Su, Lili and Yan, Cilin and Xu, Sheng and Yuan, Pengcheng and Jiang, Xiaolong and Zhang, Baochang},
  year = {2024},
  journal = {IEEE Transactions on Geoscience and Remote Sensing},
  volume = {62},
  pages = {1--17},
  issn = {1558-0644},
  doi = {10.1109/TGRS.2024.3439395},
  keywords = {Building extraction,Buildings,change detection (CD),Data mining,Feature extraction,federated training,foundation model,Image segmentation,Remote sensing,remote sensing images,Task analysis,Training}
}
Downloads last month
13
Safetensors
Model size
27.5M params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BiliSakura/RSBuilding-Swin-T