upload model
Browse files- README.md +119 -3
- config.json +18 -0
- levir_vocab.json +1 -0
- pytorch_model.bin +3 -0
README.md
CHANGED
|
@@ -1,3 +1,119 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: pytorch
|
| 6 |
+
tags:
|
| 7 |
+
- remote-sensing
|
| 8 |
+
- change-detection
|
| 9 |
+
- image-captioning
|
| 10 |
+
- multimodal
|
| 11 |
+
- retrieval
|
| 12 |
+
datasets:
|
| 13 |
+
- levir-cc
|
| 14 |
+
pipeline_tag: image-to-text
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning
|
| 18 |
+
|
| 19 |
+
**RSICRC** is a multimodal foundation model designed for **bi-temporal remote sensing images**. It jointly performs **change captioning** (describing changes between two images) and **text-image retrieval** (finding image pairs that match a text description).
|
| 20 |
+
|
| 21 |
+
The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously.
|
| 22 |
+
|
| 23 |
+
## 📄 Paper
|
| 24 |
+
**Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning** *Roger Ferrod, Luigi Di Caro, Dino Ienco* Published at **Discovery Science 2024**
|
| 25 |
+
|
| 26 |
+
[**Read the Paper**](https://doi.org/10.1007/978-3-031-78980-9_15) | [**GitHub Repository**](https://github.com/rogerferrod/RSICRC)
|
| 27 |
+
|
| 28 |
+
## 🏗️ Model Architecture
|
| 29 |
+
The framework is inspired by **CoCa** but adapted for bi-temporal remote sensing data.
|
| 30 |
+
* **Encoder:** A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features.
|
| 31 |
+
* **Decoder:** A decoupled Transformer decoder split into:
|
| 32 |
+
* **Unimodal Layers:** Encode text only (used for contrastive alignment).
|
| 33 |
+
* **Multimodal Layers:** Apply cross-attention between visual and textual features to generate captions.
|
| 34 |
+
|
| 35 |
+
## 💻 Usage
|
| 36 |
+
To use this model use the custom source code.
|
| 37 |
+
|
| 38 |
+
### Inference Code
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
import torch
|
| 42 |
+
import json
|
| 43 |
+
import open_clip
|
| 44 |
+
from huggingface_hub import hf_hub_download
|
| 45 |
+
|
| 46 |
+
from src.model import ICCModel
|
| 47 |
+
|
| 48 |
+
# 1. Download necessary files from Hugging Face
|
| 49 |
+
repo_id = "rogerferrod/RSICRC"
|
| 50 |
+
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
|
| 51 |
+
vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json")
|
| 52 |
+
weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
|
| 53 |
+
|
| 54 |
+
# 2. Load Configuration and Vocabulary
|
| 55 |
+
with open(config_path, 'r') as f:
|
| 56 |
+
config = json.load(f)
|
| 57 |
+
|
| 58 |
+
with open(vocab_path, 'r') as f:
|
| 59 |
+
vocab = json.load(f)
|
| 60 |
+
|
| 61 |
+
# 3. Setup Device and Backbone (OpenCLIP)
|
| 62 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
| 63 |
+
clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone'])
|
| 64 |
+
|
| 65 |
+
# 4. Initialize the Model
|
| 66 |
+
model = ICCModel(
|
| 67 |
+
device=device,
|
| 68 |
+
clip=clip_model,
|
| 69 |
+
backbone=config['backbone'],
|
| 70 |
+
d_model=config['d_model'],
|
| 71 |
+
vocab_size=len(vocab),
|
| 72 |
+
max_len=config['max_len'],
|
| 73 |
+
num_heads=config['num_heads'],
|
| 74 |
+
h_dim=config['h_dim'],
|
| 75 |
+
a_dim=config['a_dim'],
|
| 76 |
+
encoder_layers=config['encoder_layers'],
|
| 77 |
+
decoder_layers=config['decoder_layers'],
|
| 78 |
+
dropout=config['dropout'],
|
| 79 |
+
learnable=config['learnable'],
|
| 80 |
+
fine_tune=config['fine_tune'],
|
| 81 |
+
tie_embeddings=config['tie_embeddings'],
|
| 82 |
+
prenorm=config['prenorm']
|
| 83 |
+
)
|
| 84 |
+
|
| 85 |
+
# 5. Load Weights
|
| 86 |
+
model.load_state_dict(torch.load(weights_path, map_location=device))
|
| 87 |
+
model = model.to(device)
|
| 88 |
+
model.eval()
|
| 89 |
+
|
| 90 |
+
print("Model loaded successfully!")
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
## 📚 Citation
|
| 94 |
+
|
| 95 |
+
If you use this model or code in your research, please cite our paper:
|
| 96 |
+
|
| 97 |
+
```bibtext
|
| 98 |
+
@InProceedings{10.1007/978-3-031-78980-9_15,
|
| 99 |
+
author = {Roger Ferrod and
|
| 100 |
+
Luigi Di Caro and
|
| 101 |
+
Dino Ienco},
|
| 102 |
+
editor = {Dino Pedreschi and
|
| 103 |
+
Anna Monreale and
|
| 104 |
+
Riccardo Guidotti and
|
| 105 |
+
Roberto Pellungrini and
|
| 106 |
+
Francesca Naretto},
|
| 107 |
+
title = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval
|
| 108 |
+
and Captioning},
|
| 109 |
+
booktitle = {Discovery Science - 27th International Conference, {DS} 2024, Pisa,
|
| 110 |
+
Italy, October 14-16, 2024, Proceedings, Part {II}},
|
| 111 |
+
series = {Lecture Notes in Computer Science},
|
| 112 |
+
volume = {15244},
|
| 113 |
+
pages = {231--245},
|
| 114 |
+
publisher = {Springer},
|
| 115 |
+
year = {2024},
|
| 116 |
+
url = {https://doi.org/10.1007/978-3-031-78980-9\_15},
|
| 117 |
+
doi = {10.1007/978-3-031-78980-9\_15}
|
| 118 |
+
}
|
| 119 |
+
```
|
config.json
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"backbone": "RN50",
|
| 3 |
+
"d_model": 2048,
|
| 4 |
+
"max_len": 41,
|
| 5 |
+
"encoder_layers": 3,
|
| 6 |
+
"decoder_layers": 1,
|
| 7 |
+
"num_heads": 8,
|
| 8 |
+
"h_dim": 512,
|
| 9 |
+
"a_dim": 2048,
|
| 10 |
+
"dropout": 0.1,
|
| 11 |
+
"learnable": false,
|
| 12 |
+
"fine_tune": true,
|
| 13 |
+
"tie_embeddings": true,
|
| 14 |
+
"prenorm": false,
|
| 15 |
+
"s-transformers": "sentence-transformers/msmarco-distilbert-cos-v5",
|
| 16 |
+
"s-threshold": 1.0,
|
| 17 |
+
"fna": true
|
| 18 |
+
}
|
levir_vocab.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"the": 4, "a": 5, "is": 6, "has": 7, "road": 8, "and": 9, "built": 10, "are": 11, "no": 12, "houses": 13, "on": 14, "of": 15, "two": 16, "scene": 17, "in": 18, "some": 19, "roads": 20, "appear": 21, "as": 22, "almost": 23, "there": 24, "changed": 25, "change": 26, "same": 27, "before": 28, "occurred": 29, "identical": 30, "seem": 31, "scenes": 32, "difference": 33, "nothing": 34, "buildings": 35, "bareland": 36, "many": 37, "with": 38, "appears": 39, "been": 40, "at": 41, "trees": 42, "along": 43, "constructed": 44, "villas": 45, "several": 46, "have": 47, "building": 48, "corner": 49, "removed": 50, "sides": 51, "both": 52, "house": 53, "top": 54, "right": 55, "left": 56, "bottom": 57, "beside": 58, "to": 59, "parking": 60, "woods": 61, "rows": 62, "lot": 63, "by": 64, "area": 65, "forest": 66, "side": 67, "near": 68, "around": 69, "plants": 70, "original": 71, "residential": 72, "desert": 73, "new": 74, "replace": 75, "large": 76, "row": 77, "three": 78, "small": 79, "vegetation": 80, "replaced": 81, "disappear": 82, "grass": 83, "neatly": 84, "upper": 85, "one": 86, "up": 87, "more": 88, "arranged": 89, "appeared": 90, "into": 91, "next": 92, "detached": 93, "open": 94, "space": 95, "disappears": 96, "lots": 97, "wasteland": 98, "villa": 99, "clearing": 100, "it": 101, "path": 102, "meadow": 103, "massive": 104, "crossroad": 105, "across": 106, "show": 107, "center": 108, "between": 109, "middle": 110, "cement": 111, "edge": 112, "lower-right": 113, "four": 114, "becomes": 115, "among": 116, "concrete": 117, "lower-left": 118, "developed": 119, "winding": 120, "former": 121, "grassland": 122, "few": 123, "straight": 124, "completed": 125, "vertical": 126, "an": 127, "rebuilt": 128, "shows": 129, "alongside": 130, "square": 131, "replaces": 132, "t-shaped": 133, "newly": 134, "them": 135, "cross": 136, "erected": 137, "most": 138, "roadside": 139, "connected": 140, "end": 141, "cars": 142, "main": 143, "ground": 144, "big": 145, "crossing": 146, "neat": 147, "turning": 148, "lake": 149, "part": 150, "lines": 151, "pool": 152, "another": 153, "added": 154, "this": 155, "old": 156, "bushes": 157, "woodland": 158, "half": 159, "all": 160, "become": 161, "ones": 162, "scattered": 163, "disappeared": 164, "construction": 165, "lower": 166, "other": 167, "above": 168, "replacing": 169, "line": 170, "branch": 171, "through": 172, "paths": 173, "located": 174, "existing": 175, "place": 176, "below": 177, "group": 178, "wide": 179, "street": 180, "surrounded": 181, "lush": 182, "sparse": 183, "dirt": 184, "front": 185, "turned": 186, "field": 187, "huge": 188, "dense": 189, "bare": 190, "surrounding": 191, "extended": 192, "reconstructed": 193, "swimming": 194, "down": 195, "ring": 196, "cut": 197, "track": 198, "runs": 199, "cleared": 200, "giant": 201, "land": 202, "room": 203, "intersecting": 204, "curved": 205, "parallel": 206, "blocks": 207, "widened": 208, "site": 209, "constructions": 210, "situated": 211, "join": 212, "circular": 213, "white": 214, "from": 215, "complex": 216, "parked": 217, "bypass": 218, "five": 219, "while": 220, "narrow": 221, "vanished": 222, "screen": 223, "playground": 224, "turn": 225, "distributed": 226, "long": 227, "wood": 228, "lusher": 229, "grow": 230, "planted": 231, "areas": 232, "turns": 233, "rooms": 234, "green": 235, "crossroads": 236, "nearby": 237, "extends": 238, "vehicles": 239, "either": 240, "water": 241, "single": 242, "round": 243, "grown": 244, "arc": 245, "decreases": 246, "separate": 247, "mansion": 248, "circle": 249, "run": 250, "storage": 251, "pools": 252, "shrubs": 253, "staggered": 254, "withered": 255, "for": 256, "expanded": 257, "grows": 258, "forests": 259, "bungalow": 260, "others": 261, "quantity": 262, "finished": 263, "places": 264, "mansions": 265, "warehouse": 266, "factory": 267, "asphalt": 268, "tanks": 269, "medium": 270, "angle": 271, "intersection": 272, "surround": 273, "picture": 274, "sites": 275, "parts": 276, "corners": 277, "renovated": 278, "level": 279, "leading": 280, "bigger": 281, "higher": 282, "filled": 283, "floor": 284, "structures": 285, "bungalows": 286, "roadsides": 287, "covered": 288, "certain": 289, "river": 290, "park": 291, "roundabout": 292, "lined": 293, "pond": 294, "densely": 295, "increased": 296, "branches": 297, "extend": 298, "its": 299, "facilities": 300, "fields": 301, "mall": 302, "greatly": 303, "trucks": 304, "running": 305, "stand": 306, "attached": 307, "yards": 308, "connecting": 309, "larger": 310, "vacant": 311, "wider": 312, "restored": 313, "formed": 314, "unsurfaced": 315, "extra": 316, "reservoir": 317, "number": 318, "under": 319, "basketball": 320, "central": 321, "out": 322, "jungle": 323, "decreased": 324, "vanishes": 325, "much": 326, "risen": 327, "behind": 328, "court": 329, "fill": 330, "converge": 331, "well": 332, "village": 333, "red": 334, "longer": 335, "yard": 336, "block": 337, "playgrounds": 338, "horizontal": 339, "each": 340, "bypasses": 341, "closely": 342, "piece": 343, "greener": 344, "tower": 345, "that": 346, "divided": 347, "tracks": 348, "meadows": 349, "where": 350, "spaced": 351, "build": 352, "tank": 353, "reshaped": 354, "widens": 355, "abandoned": 356, "warehouses": 357, "demolished": 358, "malls": 359, "trails": 360, "moorland": 361, "connects": 362, "squares": 363, "rugged": 364, "t-junction": 365, "only": 366, "containers": 367, "full": 368, "less": 369, "make": 370, "dry": 371, "going": 372, "leveled": 373, "groups": 374, "were": 375, "reduced": 376, "broadened": 377, "uneven": 378, "image": 379, "transformed": 380, "respective": 381, "but": 382, "streets": 383, "yellow": 384, "joint": 385, "viaducts": 386, "flat": 387, "enlarged": 388, "orderly": 389, "foundation": 390, "thicker": 391, "smaller": 392, "trail": 393, "mounds": 394, "realized": 395, "which": 396, "increases": 397, "pasture": 398, "growing": 399, "was": 400, "structure": 401, "courts": 402, "similar": 403, "car": 404, "converted": 405, "surrounds": 406, "form": 407, "consisting": 408, "blue": 409, "complexes": 410, "opposite": 411, "shaped": 412, "besides": 413, "depots": 414, "remaining": 415, "brushwoods": 416, "divide": 417, "lane": 418, "intersect": 419, "bank": 420, "roof": 421, "fewer": 422, "u-shaped": 423, "empty": 424, "t": 425, "ponds": 426, "cottage": 427, "sundries": 428, "overpasses": 429, "additional": 430, "denser": 431, "fades": 432, "width": 433, "pulled": 434, "directions": 435, "platform": 436, "luxuriant": 437, "stretches": 438, "vibrant": 439, "rectangular": 440, "vitals": 441, "stadium": 442, "flattened": 443, "playing": 444, "spaces": 445, "stuff": 446, "hole": 447, "viaduct": 448, "raw": 449, "take": 450, "barelands": 451, "stretch": 452, "continues": 453, "tree": 454, "woodlands": 455, "face": 456, "comes": 457, "six": 458, "cluster": 459, "takes": 460, "uncompleted": 461, "moved": 462, "boxes": 463, "missing": 464, "curve": 465, "bridges": 466, "different": 467, "PAD": 0, "START": 1, "UNK": 2, "END": 3}
|
pytorch_model.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1b55e57f2281a6ba81a6bfef363905e06ab6eaf3b0d11152ab70083cf9c7c731
|
| 3 |
+
size 1404020822
|