File size: 3,341 Bytes

---
base_model:
- openai/clip-vit-large-patch14
- BAAI/bge-small-en-v1.5
- torchgeo/vit_small_patch16_224_sentinel2_all_moco
- DominikM198/OSM-MAE
datasets:
- DominikM198/PP2-M
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- SpatialRepresentationLearning
- GeoFoundationModel
- GeoFM
- ContrastiveLearning
- Mutlimodal
---

# UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

This repository provides the **pretrained weights** of the **UrbanFusion** model — a framework for learning robust spatial representations through stochastic multimodal fusion, as presented in the paper [UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations](https://huggingface.co/papers/2510.13774).

UrbanFusion can generate **location encodings** from *any subset* of the following modalities:  
- 📍 Geographic coordinates  
- 🏙️ Street-view imagery  
- 🛰️ Remote sensing data  
- 🗺️ OSM basemaps  
- 🏬 Points of interest (POIs)  

🔗 The full **source code** is available on [GitHub](https://github.com/DominikM198/UrbanFusion).  

---

## Minimal Usage Example 
Using pretrained models for location encoding is straightforward. The example below demonstrates how to load the model and generate representations based solely on geographic coordinates (latitude and longitude), without requiring any additional input modalities.
```python
import torch
from huggingface_hub import hf_hub_download
from srl.multi_modal_encoder.load import get_urbanfusion

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Coordinates: batch of 32 (lat, lon) pairs
coords = torch.randn(32, 2).to(device)

# Placeholders for other modalities (SV, RS, OSM, POI)
placeholder = torch.empty(32).to(device)
inputs = [coords, placeholder, placeholder, placeholder, placeholder]

# Mask all but coordinates (indices: 0=coords, 1=SV, 2=RS, 3=OSM, 4=POI)
mask_indices = [1, 2, 3, 4]

# Load pretrained UrbanFusion model
ckpt = hf_hub_download("DominikM198/UrbanFusion", "UrbanFusion/UrbanFusion.ckpt")
model = get_urbanfusion(ckpt, device=device).eval()

# Encode inputs (output shape: [32, 768])
with torch.no_grad():
    embeddings = model(inputs, mask_indices=mask_indices, return_representations=True).cpu()
```
For a more comprehensive guide—including instructions on applying the model to downstream tasks and incorporating additional modalities (with options for downloading, preprocessing, and using contextual prompts with or without precomputed features)—see the following tutorials:

- [`UrbanFusion_coordinates_only.ipynb`](https://github.com/DominikM198/UrbanFusion/blob/main/tutorials/UrbanFusion_coordinates_only.ipynb)  
- [`UrbanFusion_multimodal.ipynb`](https://github.com/DominikM198/UrbanFusion/blob/main/tutorials/UrbanFusion_multimodal.ipynb)  

---

## 📖 Citation

```bibtex
@article{muehlematter2025urbanfusion,
  title   = {UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations},
  author  = {Dominik J. Mühlematter and Lin Che and Ye Hong and Martin Raubal and Nina Wiedemann},
  year    = {2025},
  journal = {arXiv preprint arXiv:2510.13774},
  eprint  = {2510.13774},
  archivePrefix = {arXiv},
  url     = {https://arxiv.org/abs/2510.13774},
}
```