--- language: - en license: mit library_name: pytorch tags: - remote-sensing - change-detection - image-captioning - multimodal - retrieval pipeline_tag: image-to-text datasets: - lcybuaa/LEVIR-CC --- # RSICRC: Multimodal Remote Sensing Image Change Retrieval and Captioning **RSICRC** is a multimodal foundation model designed for **bi-temporal remote sensing images**. It jointly performs **change captioning** (describing changes between two images) and **text-image retrieval** (finding image pairs that match a text description). The model leverages Contrastive Learning and a decoupled decoder architecture to handle both tasks simultaneously. ## 📄 Paper **Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning** *Roger Ferrod, Luigi Di Caro, Dino Ienco* Published at **Discovery Science 2024** [**Read the Paper**](https://doi.org/10.1007/978-3-031-78980-9_15) | [**GitHub Repository**](https://github.com/rogerferrod/RSICRC) ## 🏗️ Model Architecture The framework is inspired by **CoCa** but adapted for bi-temporal remote sensing data. * **Encoder:** A Siamese network (ResNet-50 or ViT via OpenCLIP) that encodes "before" and "after" images. A Hierarchical Self-Attention (HSA) block and a residual block with a cosine mask fuse the bi-temporal features. * **Decoder:** A decoupled Transformer decoder split into: * **Unimodal Layers:** Encode text only (used for contrastive alignment). * **Multimodal Layers:** Apply cross-attention between visual and textual features to generate captions. ## 💻 Usage To use this model use the custom source code. ### Inference Code ```python import torch import json import open_clip from huggingface_hub import hf_hub_download from src.model import ICCModel # 1. Download necessary files from Hugging Face repo_id = "rogerferrod/RSICRC" config_path = hf_hub_download(repo_id=repo_id, filename="config.json") vocab_path = hf_hub_download(repo_id=repo_id, filename="levir_vocab.json") weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin") # 2. Load Configuration and Vocabulary with open(config_path, 'r') as f: config = json.load(f) with open(vocab_path, 'r') as f: vocab = json.load(f) # 3. Setup Device and Backbone (OpenCLIP) device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') clip_model, _, preprocess = open_clip.create_model_and_transforms(config['backbone']) # 4. Initialize the Model model = ICCModel( device=device, clip=clip_model, backbone=config['backbone'], d_model=config['d_model'], vocab_size=len(vocab), max_len=config['max_len'], num_heads=config['num_heads'], h_dim=config['h_dim'], a_dim=config['a_dim'], encoder_layers=config['encoder_layers'], decoder_layers=config['decoder_layers'], dropout=config['dropout'], learnable=config['learnable'], fine_tune=config['fine_tune'], tie_embeddings=config['tie_embeddings'], prenorm=config['prenorm'] ) # 5. Load Weights model.load_state_dict(torch.load(weights_path, map_location=device)) model = model.to(device) model.eval() print("Model loaded successfully!") ``` ## 📚 Citation If you use this model or code in your research, please cite our paper: ```bibtext @InProceedings{10.1007/978-3-031-78980-9_15, author = {Roger Ferrod and Luigi Di Caro and Dino Ienco}, editor = {Dino Pedreschi and Anna Monreale and Riccardo Guidotti and Roberto Pellungrini and Francesca Naretto}, title = {Towards a Multimodal Framework for Remote Sensing Image Change Retrieval and Captioning}, booktitle = {Discovery Science - 27th International Conference, {DS} 2024, Pisa, Italy, October 14-16, 2024, Proceedings, Part {II}}, series = {Lecture Notes in Computer Science}, volume = {15244}, pages = {231--245}, publisher = {Springer}, year = {2024}, url = {https://doi.org/10.1007/978-3-031-78980-9\_15}, doi = {10.1007/978-3-031-78980-9\_15} } ```