BiliSakura commited on
Commit
e418fa8
·
verified ·
1 Parent(s): a5800f1

Upload RSBuilding-ViT-B

Browse files
Files changed (4) hide show
  1. README.md +186 -0
  2. config.json +32 -0
  3. model.safetensors +3 -0
  4. preprocessor_config.json +23 -0
README.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - remote-sensing
5
+ - computer-vision
6
+ - vision-transformer
7
+ - sam
8
+ - building-extraction
9
+ - change-detection
10
+ - foundation-model
11
+ datasets:
12
+ - remote-sensing-images
13
+ model-index:
14
+ - name: RSBuilding-ViT-B
15
+ results: []
16
+ library_name: transformers
17
+ pipeline_tag: feature-extraction
18
+ ---
19
+
20
+ # RSBuilding-ViT-B
21
+
22
+ HuggingFace Transformers version of RSBuilding ViT-Base model (ViTSAM_Normal), converted from MMCV format to SamVisionModel format.
23
+
24
+ ## Source
25
+
26
+ - **Source Code**: [https://github.com/Meize0729/RSBuilding](https://github.com/Meize0729/RSBuilding)
27
+ - **Original Checkpoint**: [https://huggingface.co/models/BiliSakura/RSBuilding](https://huggingface.co/models/BiliSakura/RSBuilding)
28
+
29
+ ## Model Information
30
+
31
+ - **Architecture**: Vision Transformer Base (SAM-style)
32
+ - **Hidden Size**: 768
33
+ - **Number of Layers**: 12
34
+ - **Number of Attention Heads**: 12
35
+ - **MLP Dimension**: 3072
36
+ - **Image Size**: 512×512
37
+ - **Patch Size**: 16×16
38
+ - **Window Size**: 7
39
+ - **Global Attention Indexes**: [2, 5, 8, 11]
40
+
41
+ ## Important Notes
42
+
43
+ ### Missing Neck Module Keys (Expected)
44
+
45
+ When loading this model, you may see messages about missing neck module keys (typically ~6 keys). **This is expected and normal.**
46
+
47
+ **What is the neck module?**
48
+ - The neck module is a channel reduction layer that reduces ViT output from 768 channels to 256 channels
49
+ - It consists of: Conv1x1 → LayerNorm → Conv3x3 → LayerNorm
50
+ - Purpose: Improves efficiency and prepares features for downstream tasks (mask decoder, etc.)
51
+
52
+ **Why they're missing:**
53
+ - The source checkpoint (ViTSAM_Normal) may not include neck/channel_reduction weights
54
+ - The HuggingFace SamVisionModel expects a neck module as part of its architecture
55
+ - Missing neck weights will be initialized using HuggingFace's default initialization
56
+
57
+ **Action required:**
58
+ - For inference: The model will work, but you may want to fine-tune the neck module on your downstream task
59
+ - For best results: Consider initializing neck weights from a pretrained SAM checkpoint or fine-tuning them
60
+
61
+ ### Missing Buffer Keys (Expected)
62
+
63
+ You may also see messages about missing buffer keys. These are buffers computed dynamically:
64
+ - `relative_position_index`: Precomputed index mapping for window attention
65
+ - `relative_coords_table`: Precomputed coordinate table
66
+
67
+ **Action required:** None. These are computed automatically during initialization.
68
+
69
+ ## Quick Start
70
+
71
+ ### Installation
72
+
73
+ ```bash
74
+ pip install transformers torch pillow
75
+ ```
76
+
77
+ ### Inference Example
78
+
79
+ ```python
80
+ from transformers import SamVisionModel, AutoImageProcessor
81
+ from PIL import Image
82
+ import torch
83
+
84
+ # Load model and processor
85
+ model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-B")
86
+ processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-ViT-B")
87
+
88
+ # Load and process image
89
+ image = Image.open("your_image.jpg")
90
+ inputs = processor(image, return_tensors="pt")
91
+
92
+ # Forward pass
93
+ with torch.no_grad():
94
+ outputs = model(**inputs)
95
+
96
+ # Get features
97
+ # outputs.last_hidden_state: (batch_size, num_patches, hidden_size)
98
+ # outputs.pooler_output: (batch_size, hidden_size) - pooled representation
99
+ features = outputs.last_hidden_state
100
+ pooled_features = outputs.pooler_output
101
+
102
+ print(f"Feature shape: {features.shape}")
103
+ print(f"Pooled feature shape: {pooled_features.shape}")
104
+ ```
105
+
106
+ ### Feature Extraction for Downstream Tasks
107
+
108
+ ```python
109
+ from transformers import SamVisionModel, AutoImageProcessor
110
+ import torch
111
+
112
+ model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-B")
113
+ processor = AutoImageProcessor.from_pretrained("BiliSakura/RSBuilding-ViT-B")
114
+
115
+ # Process image
116
+ image = Image.open("your_image.jpg")
117
+ inputs = processor(image, return_tensors="pt")
118
+
119
+ # Extract features
120
+ with torch.no_grad():
121
+ outputs = model(**inputs)
122
+
123
+ # Use pooled features for classification/regression
124
+ features = outputs.pooler_output # Shape: (1, 768)
125
+
126
+ # Or use last hidden state for dense prediction tasks
127
+ spatial_features = outputs.last_hidden_state # Shape: (1, num_patches, 768)
128
+
129
+ # Access neck output (after channel reduction to 256)
130
+ # Note: This requires accessing model internals
131
+ neck_output = model.vision_encoder.neck(outputs.last_hidden_state) # Shape: (1, 256, H, W)
132
+ ```
133
+
134
+ ### Fine-tuning the Neck Module
135
+
136
+ If you need to fine-tune the neck module:
137
+
138
+ ```python
139
+ from transformers import SamVisionModel
140
+ import torch
141
+
142
+ model = SamVisionModel.from_pretrained("BiliSakura/RSBuilding-ViT-B")
143
+
144
+ # Option 1: Freeze backbone, train only neck
145
+ for param in model.vision_encoder.encoder.parameters():
146
+ param.requires_grad = False
147
+ for param in model.vision_encoder.neck.parameters():
148
+ param.requires_grad = True
149
+
150
+ # Option 2: Initialize neck from pretrained SAM
151
+ from transformers import SamVisionModel as PretrainedSAM
152
+ pretrained_sam = PretrainedSAM.from_pretrained("facebook/sam-vit-base")
153
+ model.vision_encoder.neck.load_state_dict(pretrained_sam.vision_encoder.neck.state_dict())
154
+ ```
155
+
156
+ ## Model Configuration
157
+
158
+ The model uses the following configuration:
159
+ - `hidden_size`: 768
160
+ - `num_hidden_layers`: 12
161
+ - `num_attention_heads`: 12
162
+ - `mlp_dim`: 3072
163
+ - `image_size`: 512
164
+ - `patch_size`: 16
165
+ - `window_size`: 7
166
+ - `output_channels`: 256 (neck output)
167
+ - `global_attn_indexes`: [2, 5, 8, 11]
168
+
169
+ ## Citation
170
+
171
+ If you use this model, please cite the original RSBuilding paper:
172
+
173
+ ```bibtex
174
+ @article{wangRSBuildingGeneralRemote2024a,
175
+ title = {{{RSBuilding}}: {{Toward General Remote Sensing Image Building Extraction}} and {{Change Detection With Foundation Model}}},
176
+ shorttitle = {{{RSBuilding}}},
177
+ author = {Wang, Mingze and Su, Lili and Yan, Cilin and Xu, Sheng and Yuan, Pengcheng and Jiang, Xiaolong and Zhang, Baochang},
178
+ year = {2024},
179
+ journal = {IEEE Transactions on Geoscience and Remote Sensing},
180
+ volume = {62},
181
+ pages = {1--17},
182
+ issn = {1558-0644},
183
+ doi = {10.1109/TGRS.2024.3439395},
184
+ keywords = {Building extraction,Buildings,change detection (CD),Data mining,Feature extraction,federated training,foundation model,Image segmentation,Remote sensing,remote sensing images,Task analysis,Training}
185
+ }
186
+ ```
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SamVisionModel"
4
+ ],
5
+ "attention_dropout": 0.0,
6
+ "dtype": "float32",
7
+ "global_attn_indexes": [
8
+ 2,
9
+ 5,
10
+ 8,
11
+ 11
12
+ ],
13
+ "hidden_act": "gelu",
14
+ "hidden_size": 768,
15
+ "image_size": 512,
16
+ "initializer_range": 1e-10,
17
+ "layer_norm_eps": 1e-06,
18
+ "mlp_dim": 3072,
19
+ "mlp_ratio": 4.0,
20
+ "model_type": "sam_vision_model",
21
+ "num_attention_heads": 12,
22
+ "num_channels": 3,
23
+ "num_hidden_layers": 12,
24
+ "num_pos_feats": 128,
25
+ "output_channels": 256,
26
+ "patch_size": 16,
27
+ "qkv_bias": true,
28
+ "transformers_version": "5.0.0.dev0",
29
+ "use_abs_pos": true,
30
+ "use_rel_pos": true,
31
+ "window_size": 7
32
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:300505439db2b5847c04fd7dce28e3788f0510ff7de44db5f3c77c2205e6f857
3
+ size 349077576
preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_processor_type": "SamImageProcessor",
3
+ "do_resize": true,
4
+ "size": {
5
+ "longest_edge": 512
6
+ },
7
+ "resample": 2,
8
+ "do_rescale": true,
9
+ "rescale_factor": 0.00392156862745098,
10
+ "do_normalize": true,
11
+ "image_mean": [
12
+ 0.485,
13
+ 0.456,
14
+ 0.406
15
+ ],
16
+ "image_std": [
17
+ 0.229,
18
+ 0.224,
19
+ 0.225
20
+ ],
21
+ "do_convert_rgb": true,
22
+ "do_pad": false
23
+ }