base_model:
- openai/clip-vit-large-patch14
- BAAI/bge-small-en-v1.5
- torchgeo/vit_small_patch16_224_sentinel2_all_moco
- DominikM198/OSM-MAE
datasets:
- DominikM198/PP2-M
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- SpatialRepresentationLearning
- GeoFoundationModel
- GeoFM
- ContrastiveLearning
- Mutlimodal
UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
This repository provides the pretrained weights of the UrbanFusion model — a framework for learning robust spatial representations through stochastic multimodal fusion, as presented in the paper UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations.
UrbanFusion can generate location encodings from any subset of the following modalities:
- 📍 Geographic coordinates
- 🏙️ Street-view imagery
- 🛰️ Remote sensing data
- 🗺️ OSM basemaps
- 🏬 Points of interest (POIs)
🔗 The full source code is available on GitHub.
Minimal Usage Example
Using pretrained models for location encoding is straightforward. The example below demonstrates how to load the model and generate representations based solely on geographic coordinates (latitude and longitude), without requiring any additional input modalities.
import torch
from huggingface_hub import hf_hub_download
from srl.multi_modal_encoder.load import get_urbanfusion
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Coordinates: batch of 32 (lat, lon) pairs
coords = torch.randn(32, 2).to(device)
# Placeholders for other modalities (SV, RS, OSM, POI)
placeholder = torch.empty(32).to(device)
inputs = [coords, placeholder, placeholder, placeholder, placeholder]
# Mask all but coordinates (indices: 0=coords, 1=SV, 2=RS, 3=OSM, 4=POI)
mask_indices = [1, 2, 3, 4]
# Load pretrained UrbanFusion model
ckpt = hf_hub_download("DominikM198/UrbanFusion", "UrbanFusion/UrbanFusion.ckpt")
model = get_urbanfusion(ckpt, device=device).eval()
# Encode inputs (output shape: [32, 768])
with torch.no_grad():
embeddings = model(inputs, mask_indices=mask_indices, return_representations=True).cpu()
For a more comprehensive guide—including instructions on applying the model to downstream tasks and incorporating additional modalities (with options for downloading, preprocessing, and using contextual prompts with or without precomputed features)—see the following tutorials:
📖 Citation
@article{muehlematter2025urbanfusion,
title = {UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations},
author = {Dominik J. Mühlematter and Lin Che and Ye Hong and Martin Raubal and Nina Wiedemann},
year = {2025},
journal = {arXiv preprint arXiv:2510.13774},
eprint = {2510.13774},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2510.13774},
}