--- license: mit tags: - remote-sensing - computer-vision - vision-transformer - sam - building-extraction - change-detection - foundation-model datasets: - remote-sensing-images model-index: - name: RSBuilding-ViT-L results: [] library_name: transformers pipeline_tag: feature-extraction --- # RSBuilding-ViT-L HuggingFace Transformers version of RSBuilding ViT-Large model (ViTSAM_Normal), converted from MMCV format to SamVisionModel format. ## Source - **Source Code**: [https://github.com/Meize0729/RSBuilding](https://github.com/Meize0729/RSBuilding) - **Original Checkpoint**: [https://huggingface.co/models/BiliSakura/RSBuilding](https://huggingface.co/models/BiliSakura/RSBuilding) ## Model Information - **Architecture**: Vision Transformer Large (SAM-style) - **Hidden Size**: 1024 - **Number of Layers**: 24 - **Number of Attention Heads**: 16 - **MLP Dimension**: 4096 - **Image Size**: 512×512 - **Patch Size**: 16×16 - **Window Size**: 7 - **Global Attention Indexes**: [5, 11, 17, 23] ## Important Notes ### Missing Neck Module Keys (Expected) When loading this model, you may see messages about missing neck module keys (typically ~6 keys). **This is expected and normal.** **What is the neck module?** - The neck module is a channel reduction layer that reduces ViT output from 1024 channels to 256 channels - It consists of: Conv1x1 → LayerNorm → Conv3x3 → LayerNorm - Purpose: Improves efficiency and prepares features for downstream tasks (mask decoder, etc.) **Why they're missing:** - The source checkpoint (ViTSAM_Normal) may not include neck/channel_reduction weights - The HuggingFace SamVisionModel expects a neck module as part of its architecture - Missing neck weights will be initialized using HuggingFace's default initialization **Action required:** - For inference: The model will work, but you may want to fine-tune the neck module on your downstream task - For best results: Consider initializing neck weights from a pretrained SAM checkpoint or fine-tuning them ### Missing Buffer Keys (Expected) You may also see messages about missing buffer keys. These are buffers computed dynamically: - `relative_position_index`: Precomputed index mapping for window attention - `relative_coords_table`: Precomputed coordinate table **Action required:** None. These are computed automatically during initialization. ## Quick Start ### Installation ```bash pip install transformers torch pillow ``` ### Inference Example ```python from transformers import SamVisionModel, AutoImageProcessor from PIL import Image import torch # Load model and processor model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-L") processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-ViT-L") # Load and process image image = Image.open("your_image.jpg") inputs = processor(image, return_tensors="pt") # Forward pass with torch.no_grad(): outputs = model(**inputs) # Get features # outputs.last_hidden_state: (batch_size, num_patches, hidden_size) # outputs.pooler_output: (batch_size, hidden_size) - pooled representation features = outputs.last_hidden_state pooled_features = outputs.pooler_output print(f"Feature shape: {features.shape}") print(f"Pooled feature shape: {pooled_features.shape}") ``` ### Feature Extraction for Downstream Tasks ```python from transformers import SamVisionModel, AutoImageProcessor import torch model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-L") processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-ViT-L") # Process image image = Image.open("your_image.jpg") inputs = processor(image, return_tensors="pt") # Extract features with torch.no_grad(): outputs = model(**inputs) # Use pooled features for classification/regression features = outputs.pooler_output # Shape: (1, 1024) # Or use last hidden state for dense prediction tasks spatial_features = outputs.last_hidden_state # Shape: (1, num_patches, 1024) # Access neck output (after channel reduction to 256) # Note: This requires accessing model internals neck_output = model.vision_encoder.neck(outputs.last_hidden_state) # Shape: (1, 256, H, W) ``` ### Fine-tuning the Neck Module If you need to fine-tune the neck module: ```python from transformers import SamVisionModel import torch model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-L") # Option 1: Freeze backbone, train only neck for param in model.vision_encoder.encoder.parameters(): param.requires_grad = False for param in model.vision_encoder.neck.parameters(): param.requires_grad = True # Option 2: Initialize neck from pretrained SAM from transformers import SamVisionModel as PretrainedSAM pretrained_sam = PretrainedSAM.from_pretrained("facebook/sam-vit-large") model.vision_encoder.neck.load_state_dict(pretrained_sam.vision_encoder.neck.state_dict()) ``` ## Model Configuration The model uses the following configuration: - `hidden_size`: 1024 - `num_hidden_layers`: 24 - `num_attention_heads`: 16 - `mlp_dim`: 4096 - `image_size`: 512 - `patch_size`: 16 - `window_size`: 7 - `output_channels`: 256 (neck output) - `global_attn_indexes`: [5, 11, 17, 23] ## Citation If you use this model, please cite the original RSBuilding paper: ```bibtex @article{wangRSBuildingGeneralRemote2024a, title = {{{RSBuilding}}: {{Toward General Remote Sensing Image Building Extraction}} and {{Change Detection With Foundation Model}}}, shorttitle = {{{RSBuilding}}}, author = {Wang, Mingze and Su, Lili and Yan, Cilin and Xu, Sheng and Yuan, Pengcheng and Jiang, Xiaolong and Zhang, Baochang}, year = {2024}, journal = {IEEE Transactions on Geoscience and Remote Sensing}, volume = {62}, pages = {1--17}, issn = {1558-0644}, doi = {10.1109/TGRS.2024.3439395}, keywords = {Building extraction,Buildings,change detection (CD),Data mining,Feature extraction,federated training,foundation model,Image segmentation,Remote sensing,remote sensing images,Task analysis,Training} } ```