File size: 4,139 Bytes
dcfeed7
 
 
 
 
 
 
 
 
 
 
 
4e66104
 
dcfeed7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e66104
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
language:
- en
license: mit
library_name: pytorch
tags:
- remote-sensing
- change-detection
- image-captioning
- multimodal
- retrieval
pipeline_tag: image-to-text
datasets:
- lcybuaa/LEVIR-CC
---

# RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning

**RSICRC** is a multimodal foundation model designed for **bi-temporal remote sensing images**. It jointly performs **change captioning** (describing changes between two images) and **text-image retrieval** (finding image pairs that match a text description).

The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously.

## ๐Ÿ“„ Paper
**Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning** *Roger Ferrod, Luigi Di Caro, Dino Ienco* Published at **Discovery Science 2024**

[**Read the Paper**](https://doi.org/10.1007/978-3-031-78980-9_15) | [**GitHub Repository**](https://github.com/rogerferrod/RSICRC)

## ๐Ÿ—๏ธ Model Architecture
The framework is inspired by **CoCa** but adapted for bi-temporal remote sensing data.
* **Encoder:** A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features.
* **Decoder:** A decoupled Transformer decoder split into:
    * **Unimodal Layers:** Encode text only (used for contrastive alignment).
    * **Multimodal Layers:** Apply cross-attention between visual and textual features to generate captions.

## ๐Ÿ’ป Usage
To use this model use the custom source code.

### Inference Code

```python 
import torch
import json
import open_clip
from huggingface_hub import hf_hub_download

from src.model import ICCModel 

# 1. Download necessary files from Hugging Face
repo_id = "rogerferrod/RSICRC"
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json") 
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")

# 2. Load Configuration and Vocabulary
with open(config_path, 'r') as f:
    config = json.load(f)

with open(vocab_path, 'r') as f:
    vocab = json.load(f)

# 3. Setup Device and Backbone (OpenCLIP)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone'])

# 4. Initialize the Model
model = ICCModel(
    device=device,
    clip=clip_model,
    backbone=config['backbone'],
    d_model=config['d_model'],
    vocab_size=len(vocab),
    max_len=config['max_len'],
    num_heads=config['num_heads'],
    h_dim=config['h_dim'],
    a_dim=config['a_dim'],
    encoder_layers=config['encoder_layers'],
    decoder_layers=config['decoder_layers'],
    dropout=config['dropout'],
    learnable=config['learnable'],
    fine_tune=config['fine_tune'],
    tie_embeddings=config['tie_embeddings'],
    prenorm=config['prenorm']
)

# 5. Load Weights
model.load_state_dict(torch.load(weights_path, map_location=device))
model = model.to(device)
model.eval()

print("Model loaded successfully!")
```

## ๐Ÿ“š Citation

If you use this model or code in your research, please cite our paper:

```bibtext
@InProceedings{10.1007/978-3-031-78980-9_15,
author       = {Roger Ferrod and
                  Luigi Di Caro and
                  Dino Ienco},
editor       = {Dino Pedreschi and
                Anna Monreale and
                Riccardo Guidotti and
                Roberto Pellungrini and
                Francesca Naretto},
title        = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval
                and Captioning},
booktitle    = {Discovery Science - 27th International Conference, {DS} 2024, Pisa,
                Italy, October 14-16, 2024, Proceedings, Part {II}},
series       = {Lecture Notes in Computer Science},
volume       = {15244},
pages        = {231--245},
publisher    = {Springer},
year         = {2024},
url          = {https://doi.org/10.1007/978-3-031-78980-9\_15},
doi          = {10.1007/978-3-031-78980-9\_15}
}
```