| license: apache-2.0 | |
| tags: | |
| - geolocation | |
| - vision | |
| - siglip | |
| - clip | |
| - geoclip | |
| datasets: | |
| - osv5m | |
| pipeline_tag: image-feature-extraction | |
| # GeoSpot Base | |
| A geolocation model built on SigLIP2-so400m (512px) that predicts GPS coordinates from images. | |
| ## Model Details | |
| - **Backbone**: google/siglip2-so400m-patch16-512 (frozen) | |
| - **Image Resolution**: 512x512 | |
| - **Embedding Dim**: 512 | |
| - **Training Steps**: 206k | |
| - **Training Data**: ~10.6M streetview images | |
| ## Architecture | |
| GeoCLIP-style contrastive learning between: | |
| - **Image Encoder**: SigLIP2 vision tower + MLP projection (1152 → 512) | |
| - **Location Encoder**: Multi-scale RFF encoding with learnable capsules | |
| ## Usage | |
| ```python | |
| from geoclip.model.GeoCLIP import GeoCLIP | |
| import torch | |
| model = GeoCLIP(from_pretrained=False, encoder_name="siglip2") | |
| state_dict = torch.load("model.safetensors") | |
| model.load_state_dict(state_dict) | |
| # Predict location from image | |
| top_gps, top_probs = model.predict("image.jpg", top_k=5) | |
| ``` | |