RSICRC / README.md

Update README.md

4e66104 verified 4 days ago

4.14 kB

	---
	language:
	- en
	license: mit
	library_name: pytorch
	tags:
	- remote-sensing
	- change-detection
	- image-captioning
	- multimodal
	- retrieval
	pipeline_tag: image-to-text
	datasets:
	- lcybuaa/LEVIR-CC
	---

	# RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning

	RSICRC is a multimodal foundation model designed for bi-temporal remote sensing images. It jointly performs change captioning (describing changes between two images) and text-image retrieval (finding image pairs that match a text description).

	The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously.

	## 📄 Paper
	Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning Roger Ferrod, Luigi Di Caro, Dino Ienco Published at Discovery Science 2024

	[Read the Paper](https://doi.org/10.1007/978-3-031-78980-9_15) \| [GitHub Repository](https://github.com/rogerferrod/RSICRC)

	## 🏗️ Model Architecture
	The framework is inspired by CoCa but adapted for bi-temporal remote sensing data.
	* Encoder: A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features.
	* Decoder: A decoupled Transformer decoder split into:
	* Unimodal Layers: Encode text only (used for contrastive alignment).
	* Multimodal Layers: Apply cross-attention between visual and textual features to generate captions.

	## 💻 Usage
	To use this model use the custom source code.

	### Inference Code

	```python
	import torch
	import json
	import open_clip
	from huggingface_hub import hf_hub_download

	from src.model import ICCModel

	# 1. Download necessary files from Hugging Face
	repo_id = "rogerferrod/RSICRC"
	config_path = hf_hub_download(repo_id=repo_id, filename="config.json")
	vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json")
	weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")

	# 2. Load Configuration and Vocabulary
	with open(config_path, 'r') as f:
	config = json.load(f)

	with open(vocab_path, 'r') as f:
	vocab = json.load(f)

	# 3. Setup Device and Backbone (OpenCLIP)
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone'])

	# 4. Initialize the Model
	model = ICCModel(
	device=device,
	clip=clip_model,
	backbone=config['backbone'],
	d_model=config['d_model'],
	vocab_size=len(vocab),
	max_len=config['max_len'],
	num_heads=config['num_heads'],
	h_dim=config['h_dim'],
	a_dim=config['a_dim'],
	encoder_layers=config['encoder_layers'],
	decoder_layers=config['decoder_layers'],
	dropout=config['dropout'],
	learnable=config['learnable'],
	fine_tune=config['fine_tune'],
	tie_embeddings=config['tie_embeddings'],
	prenorm=config['prenorm']
	)

	# 5. Load Weights
	model.load_state_dict(torch.load(weights_path, map_location=device))
	model = model.to(device)
	model.eval()

	print("Model loaded successfully!")
	```

	## 📚 Citation

	If you use this model or code in your research, please cite our paper:

	```bibtext
	@InProceedings{10.1007/978-3-031-78980-9_15,
	author = {Roger Ferrod and
	Luigi Di Caro and
	Dino Ienco},
	editor = {Dino Pedreschi and
	Anna Monreale and
	Riccardo Guidotti and
	Roberto Pellungrini and
	Francesca Naretto},
	title = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval
	and Captioning},
	booktitle = {Discovery Science - 27th International Conference, {DS} 2024, Pisa,
	Italy, October 14-16, 2024, Proceedings, Part {II}},
	series = {Lecture Notes in Computer Science},
	volume = {15244},
	pages = {231--245},
	publisher = {Springer},
	year = {2024},
	url = {https://doi.org/10.1007/978-3-031-78980-9\_15},
	doi = {10.1007/978-3-031-78980-9\_15}
	}
	```