CLIP-SatIR

Model Description

CLIP-SatIR is a fine-tuned CLIP (ViT-B/32) model trained on the RSICD remote sensing dataset for cross-modal satellite image retrieval. The model aligns satellite imagery and natural language captions in a shared embedding space using contrastive learning.

Supports:

  • Text-to-image retrieval
  • Image-to-image similarity search
  • Semantic satellite search

Base Model

  • openai/clip-vit-base-patch32

Dataset

  • RSICD (Remote Sensing Image Caption Dataset) The dataset contains satellite imagery paired with descriptive captions covering:
  • airports
  • farmland
  • industrial zones
  • residential areas
  • rivers
  • urban layouts

Training Details

Objective

Contrastive CLIP loss: L = CrossEntropy(sim(image_i, text_j))

The model learns:

  • maximize similarity for matching pairs
  • minimize similarity for non-matching pairs

Hyperparameters

Parameter Value
Batch Size 32
Learning Rate 1e-5
Optimizer AdamW
Epochs 5
Hardware CUDA GPU

Architecture

  • Vision Encoder → ViT-B/32
  • Text Encoder → Transformer
  • Shared 512-dimensional embedding space

Intended Use

This model is designed for:

  • Satellite image retrieval
  • Geospatial intelligence
  • Semantic satellite search
  • Urban planning analysis
  • Disaster monitoring

Limitations

  • Trained only on RSICD
  • Limited geographic diversity
  • Not suitable for critical surveillance decisions
  • Retrieval quality depends on caption semantics

Inference Example

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

model = CLIPModel.from_pretrained("rishii100/clip-rsicd-finetuned")
processor = CLIPProcessor.from_pretrained("rishii100/clip-rsicd-finetuned")

text = ["an airport with multiple airplanes and runway"]

inputs = processor(text=text, return_tensors="pt", padding=True)

with torch.no_grad():
    text_features = model.get_text_features(**inputs)

print(text_features.shape)
Downloads last month
69
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rishii100/clip-rsicd-finetuned