|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- remote-sensing |
|
|
- change-detection |
|
|
- image-captioning |
|
|
- multimodal |
|
|
- retrieval |
|
|
pipeline_tag: image-to-text |
|
|
datasets: |
|
|
- lcybuaa/LEVIR-CC |
|
|
--- |
|
|
|
|
|
# RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning |
|
|
|
|
|
**RSICRC** is a multimodal foundation model designed for **bi-temporal remote sensing images**. It jointly performs **change captioning** (describing changes between two images) and **text-image retrieval** (finding image pairs that match a text description). |
|
|
|
|
|
The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously. |
|
|
|
|
|
## π Paper |
|
|
**Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning** *Roger Ferrod, Luigi Di Caro, Dino Ienco* Published at **Discovery Science 2024** |
|
|
|
|
|
[**Read the Paper**](https://doi.org/10.1007/978-3-031-78980-9_15) | [**GitHub Repository**](https://github.com/rogerferrod/RSICRC) |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
The framework is inspired by **CoCa** but adapted for bi-temporal remote sensing data. |
|
|
* **Encoder:** A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features. |
|
|
* **Decoder:** A decoupled Transformer decoder split into: |
|
|
* **Unimodal Layers:** Encode text only (used for contrastive alignment). |
|
|
* **Multimodal Layers:** Apply cross-attention between visual and textual features to generate captions. |
|
|
|
|
|
## π» Usage |
|
|
To use this model use the custom source code. |
|
|
|
|
|
### Inference Code |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import json |
|
|
import open_clip |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
from src.model import ICCModel |
|
|
|
|
|
# 1. Download necessary files from Hugging Face |
|
|
repo_id = "rogerferrod/RSICRC" |
|
|
config_path = hf_hub_download(repo_id=repo_id, filename="config.json") |
|
|
vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json") |
|
|
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin") |
|
|
|
|
|
# 2. Load Configuration and Vocabulary |
|
|
with open(config_path, 'r') as f: |
|
|
config = json.load(f) |
|
|
|
|
|
with open(vocab_path, 'r') as f: |
|
|
vocab = json.load(f) |
|
|
|
|
|
# 3. Setup Device and Backbone (OpenCLIP) |
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone']) |
|
|
|
|
|
# 4. Initialize the Model |
|
|
model = ICCModel( |
|
|
device=device, |
|
|
clip=clip_model, |
|
|
backbone=config['backbone'], |
|
|
d_model=config['d_model'], |
|
|
vocab_size=len(vocab), |
|
|
max_len=config['max_len'], |
|
|
num_heads=config['num_heads'], |
|
|
h_dim=config['h_dim'], |
|
|
a_dim=config['a_dim'], |
|
|
encoder_layers=config['encoder_layers'], |
|
|
decoder_layers=config['decoder_layers'], |
|
|
dropout=config['dropout'], |
|
|
learnable=config['learnable'], |
|
|
fine_tune=config['fine_tune'], |
|
|
tie_embeddings=config['tie_embeddings'], |
|
|
prenorm=config['prenorm'] |
|
|
) |
|
|
|
|
|
# 5. Load Weights |
|
|
model.load_state_dict(torch.load(weights_path, map_location=device)) |
|
|
model = model.to(device) |
|
|
model.eval() |
|
|
|
|
|
print("Model loaded successfully!") |
|
|
``` |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model or code in your research, please cite our paper: |
|
|
|
|
|
```bibtext |
|
|
@InProceedings{10.1007/978-3-031-78980-9_15, |
|
|
author = {Roger Ferrod and |
|
|
Luigi Di Caro and |
|
|
Dino Ienco}, |
|
|
editor = {Dino Pedreschi and |
|
|
Anna Monreale and |
|
|
Riccardo Guidotti and |
|
|
Roberto Pellungrini and |
|
|
Francesca Naretto}, |
|
|
title = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval |
|
|
and Captioning}, |
|
|
booktitle = {Discovery Science - 27th International Conference, {DS} 2024, Pisa, |
|
|
Italy, October 14-16, 2024, Proceedings, Part {II}}, |
|
|
series = {Lecture Notes in Computer Science}, |
|
|
volume = {15244}, |
|
|
pages = {231--245}, |
|
|
publisher = {Springer}, |
|
|
year = {2024}, |
|
|
url = {https://doi.org/10.1007/978-3-031-78980-9\_15}, |
|
|
doi = {10.1007/978-3-031-78980-9\_15} |
|
|
} |
|
|
``` |