BiliSakura commited on
Commit
d69a6d2
·
verified ·
1 Parent(s): c40328c

Update README for GeoRSCLIP-ViT-H-14

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md CHANGED
@@ -17,6 +17,162 @@ This model is a mirror/redistribution of the original [GeoRSCLIP](https://huggin
17
  ## Description
18
  GeoRSCLIP is a vision-language foundation model for remote sensing, trained on a large-scale dataset of remote sensing image-text pairs (RS5M). It is based on the CLIP architecture and is designed to handle the unique characteristics of remote sensing imagery.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## Citation
21
  If you use this model in your research, please cite the original work:
22
 
 
17
  ## Description
18
  GeoRSCLIP is a vision-language foundation model for remote sensing, trained on a large-scale dataset of remote sensing image-text pairs (RS5M). It is based on the CLIP architecture and is designed to handle the unique characteristics of remote sensing imagery.
19
 
20
+ ## How to use
21
+
22
+ ### With `transformers`
23
+
24
+ ```python
25
+ from transformers import CLIPProcessor, CLIPModel
26
+ from PIL import Image
27
+ import torch
28
+
29
+ # Load model and processor
30
+ model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
31
+ processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
32
+
33
+ # Load and process image
34
+ image = Image.open("path/to/your/image.jpg")
35
+ inputs = processor(
36
+ text=["a photo of a building", "a photo of vegetation", "a photo of water"],
37
+ images=image,
38
+ return_tensors="pt",
39
+ padding=True
40
+ )
41
+
42
+ # Get image-text similarity scores
43
+ with torch.inference_mode():
44
+ outputs = model(**inputs)
45
+ logits_per_image = outputs.logits_per_image
46
+ probs = logits_per_image.softmax(dim=1)
47
+
48
+ print(f"Similarity scores: {probs}")
49
+ ```
50
+
51
+ **Zero-shot image classification:**
52
+
53
+ ```python
54
+ from transformers import CLIPProcessor, CLIPModel
55
+ from PIL import Image
56
+ import torch
57
+
58
+ model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
59
+ processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
60
+
61
+ # Define candidate labels
62
+ candidate_labels = [
63
+ "a satellite image of urban area",
64
+ "a satellite image of forest",
65
+ "a satellite image of agricultural land",
66
+ "a satellite image of water body"
67
+ ]
68
+
69
+ image = Image.open("path/to/your/image.jpg")
70
+ inputs = processor(
71
+ text=candidate_labels,
72
+ images=image,
73
+ return_tensors="pt",
74
+ padding=True
75
+ )
76
+
77
+ with torch.inference_mode():
78
+ outputs = model(**inputs)
79
+ probs = outputs.logits_per_image.softmax(dim=1)
80
+
81
+ # Get the predicted label
82
+ predicted_idx = probs.argmax().item()
83
+ print(f"Predicted label: {candidate_labels[predicted_idx]}")
84
+ print(f"Confidence: {probs[0][predicted_idx]:.4f}")
85
+ ```
86
+
87
+ **Extracting individual features:**
88
+
89
+ ```python
90
+ from transformers import CLIPProcessor, CLIPModel
91
+ from PIL import Image
92
+ import torch
93
+
94
+ model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
95
+ processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
96
+
97
+ # Get image features only
98
+ image = Image.open("path/to/your/image.jpg")
99
+ image_inputs = processor(images=image, return_tensors="pt")
100
+
101
+ with torch.inference_mode():
102
+ image_features = model.get_image_features(**image_inputs)
103
+
104
+ # Get text features only
105
+ text_inputs = processor(
106
+ text=["a satellite image of urban area"],
107
+ return_tensors="pt",
108
+ padding=True,
109
+ truncation=True
110
+ )
111
+
112
+ with torch.inference_mode():
113
+ text_features = model.get_text_features(**text_inputs)
114
+
115
+ print(f"Image features shape: {image_features.shape}")
116
+ print(f"Text features shape: {text_features.shape}")
117
+ ```
118
+
119
+ ### With `diffusers`
120
+
121
+ This model's text encoder can be used with Stable Diffusion and other diffusion models:
122
+
123
+ ```python
124
+ from diffusers import StableDiffusionPipeline
125
+ from transformers import CLIPTextModel, CLIPTokenizer
126
+ import torch
127
+
128
+ # Load the text encoder and tokenizer
129
+ text_encoder = CLIPTextModel.from_pretrained(
130
+ "BiliSakura/GeoRSCLIP-ViT-H-14/diffusers",
131
+ subfolder="text_encoder",
132
+ torch_dtype=torch.float16
133
+ )
134
+ tokenizer = CLIPTokenizer.from_pretrained(
135
+ "BiliSakura/GeoRSCLIP-ViT-H-14"
136
+ )
137
+
138
+ # Encode text prompt
139
+ prompt = "a satellite image of a city with buildings and roads"
140
+ text_inputs = tokenizer(
141
+ prompt,
142
+ padding="max_length",
143
+ max_length=77,
144
+ truncation=True,
145
+ return_tensors="pt"
146
+ )
147
+
148
+ with torch.inference_mode():
149
+ text_outputs = text_encoder(text_inputs.input_ids)
150
+ text_embeddings = text_outputs.last_hidden_state
151
+
152
+ print(f"Text embeddings shape: {text_embeddings.shape}")
153
+ ```
154
+
155
+ **Using with Stable Diffusion:**
156
+
157
+ ```python
158
+ from diffusers import StableDiffusionPipeline
159
+ import torch
160
+
161
+ # Load pipeline with custom text encoder
162
+ pipe = StableDiffusionPipeline.from_pretrained(
163
+ "runwayml/stable-diffusion-v1-5",
164
+ text_encoder=text_encoder,
165
+ tokenizer=tokenizer,
166
+ torch_dtype=torch.float16
167
+ )
168
+ pipe = pipe.to("cuda")
169
+
170
+ # Generate image
171
+ prompt = "a high-resolution satellite image of urban area"
172
+ image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
173
+ image.save("generated_image.png")
174
+ ```
175
+
176
  ## Citation
177
  If you use this model in your research, please cite the original work:
178