RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning
RSICRC is a multimodal foundation model designed for bi-temporal remote sensing images. It jointly performs change captioning (describing changes between two images) and text-image retrieval (finding image pairs that match a text description).
The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously.
π Paper
Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning Roger Ferrod, Luigi Di Caro, Dino Ienco Published at Discovery Science 2024
Read the Paper | GitHub Repository
ποΈ Model Architecture
The framework is inspired by CoCa but adapted for bi-temporal remote sensing data.
- Encoder: A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features.
- Decoder: A decoupled Transformer decoder split into:
- Unimodal Layers: Encode text only (used for contrastive alignment).
- Multimodal Layers: Apply cross-attention between visual and textual features to generate captions.
π» Usage
To use this model use the custom source code.
Inference Code
import torch
import json
import open_clip
from huggingface_hub import hf_hub_download
from src.model import ICCModel
# 1. Download necessary files from Hugging Face
repo_id = "rogerferrod/RSICRC"
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json")
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
# 2. Load Configuration and Vocabulary
with open(config_path, 'r') as f:
config = json.load(f)
with open(vocab_path, 'r') as f:
vocab = json.load(f)
# 3. Setup Device and Backbone (OpenCLIP)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone'])
# 4. Initialize the Model
model = ICCModel(
device=device,
clip=clip_model,
backbone=config['backbone'],
d_model=config['d_model'],
vocab_size=len(vocab),
max_len=config['max_len'],
num_heads=config['num_heads'],
h_dim=config['h_dim'],
a_dim=config['a_dim'],
encoder_layers=config['encoder_layers'],
decoder_layers=config['decoder_layers'],
dropout=config['dropout'],
learnable=config['learnable'],
fine_tune=config['fine_tune'],
tie_embeddings=config['tie_embeddings'],
prenorm=config['prenorm']
)
# 5. Load Weights
model.load_state_dict(torch.load(weights_path, map_location=device))
model = model.to(device)
model.eval()
print("Model loaded successfully!")
π Citation
If you use this model or code in your research, please cite our paper:
@InProceedings{10.1007/978-3-031-78980-9_15,
author = {Roger Ferrod and
Luigi Di Caro and
Dino Ienco},
editor = {Dino Pedreschi and
Anna Monreale and
Riccardo Guidotti and
Roberto Pellungrini and
Francesca Naretto},
title = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval
and Captioning},
booktitle = {Discovery Science - 27th International Conference, {DS} 2024, Pisa,
Italy, October 14-16, 2024, Proceedings, Part {II}},
series = {Lecture Notes in Computer Science},
volume = {15244},
pages = {231--245},
publisher = {Springer},
year = {2024},
url = {https://doi.org/10.1007/978-3-031-78980-9\_15},
doi = {10.1007/978-3-031-78980-9\_15}
}
- Downloads last month
- 12