AlignEarth-SAR-ViT-B-16

CLIP-style vision-language model adapted for Synthetic Aperture Radar (SAR) imagery via knowledge distillation from optical VLMs. Enables open-vocabulary semantic segmentation for SAR remote sensing without building SAR foundation models from scratch.

This repository provides the model in Hugging Face Transformers format, converted from the original OpenCLIP-style checkpoint released by the SegEarth-OV-2 authors.

Model Details

  • Architecture: CLIP (ViT-B/16 vision encoder + text encoder)
  • Vision: 12-layer ViT, 768 hidden, 16ร—16 patches, 224ร—224 input
  • Text: 12-layer transformer, 512 hidden, vocab 49408, max length 77
  • Projection: 512-dim shared embedding space
  • Source: likyoo/AlignEarth-SAR-ViT-B-16 (OpenCLIP format)
  • Conversion: Mapped to transformers.CLIPModel for standard HF usage
  • SimFeatUp: Full upsampler suite from SegEarth-OV/simfeatup_dev:
    • jbu_one โ†’ simfeatup/xclip_jbu_one_million_aid.ckpt (default, remote-sensing)
    • jbu_stack โ†’ simfeatup/clip_jbu_stack_cocostuff.ckpt
    • jbu_stack_maskclip โ†’ simfeatup/maskclip_jbu_stack_cocostuff.ckpt
    • bilinear, resize_conv, ifa (no pretrained weights)

Usage

from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")

image = Image.open("sar_image.tif")
texts = ["building", "road", "water body", "vegetation"]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding=True,
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

For dense features (e.g., segmentation with SegEarth-OV-2), use the vision encoder:

from transformers import CLIPVisionModelWithProjection, CLIPImageProcessor

vision_model = CLIPVisionModelWithProjection.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPImageProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")

inputs = processor(images=image, return_tensors="pt")
outputs = vision_model(**inputs)
image_embeds = outputs.image_embeds  # pooled
# Or use vision_model.vision_model for patch-level features

Full Pipeline (SegEarth-OV-2 Style)

For open-vocabulary SAR segmentation with SimFeatUp dense upsampling:

from pathlib import Path
from PIL import Image
from pipeline import SegEarthPipeline

pipe = SegEarthPipeline(Path("BiliSakura/AlignEarth-SAR-ViT-B-16"))
image = Image.open("your_sar_image.tif").convert("RGB")
seg_map = pipe(image)  # [H, W] class indices

The pipeline combines:

  • AlignEarth CLIP encoder (SAR-adapted)
  • SimFeatUp upsampler (choose jbu_one, jbu_stack, jbu_stack_maskclip, or bilinear)
  • Global Bias Alleviation (cls_token_lambda) โ€“ subtracts global context from patch features
  • Logit scaling and prob threshold for robust predictions
  • Sliding window for large images
  • OpenEarthMap SAR class names (customize via cls_openearthmap_sar.txt or configs/cls_*.txt)
# Use different featup models
pipe = SegEarthPipeline(Path("."), featup_model="jbu_one")       # default
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack")
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack_maskclip")
pipe = SegEarthPipeline(Path("."), featup_model="bilinear")       # no weights

# Full SegEarth-OV-2 options
pipe = SegEarthPipeline(Path("."), cls_token_lambda=-0.3, logit_scale=50, prob_thd=0)
pipe = SegEarthPipeline(Path("."), slide_crop=224, slide_stride=112)  # sliding window for large images
pipe = SegEarthPipeline(Path("."), class_names_path="configs/cls_whu_sar.txt")  # different dataset

Demo / Test

A paired demo sample from YESeg-OPT-SAR is in demo_YESeg-OPT-SAR/: sar.png, rgb.png, label.png. Note: This model targets SAR imagery, not optical.

python test_demo.py                         # uses demo_YESeg-OPT-SAR, cls_yeseg_sar, prob_thd=0.3
python test_demo.py --featup jbu_stack     # try jbu_stack upsampler
python test_demo.py --save out.png          # save figure

The script displays a matplotlib image grid: RGB | SAR | Label (GT) | Prediction.

Evaluation

Standalone evaluation (no mmseg) on image/label pairs:

python eval.py --img-dir data/OpenEarthMap_SAR/test/sar_images \\
               --label-dir data/OpenEarthMap_SAR/test/labels \\
               --config configs/cls_openearthmap_sar.txt

SAR class configs in configs/: cls_openearthmap_sar.txt, cls_whu_sar.txt, cls_hrsid.txt, cls_pie_sar.txt, cls_fusar.txt, cls_yeseg_sar.txt, cls_ddhrnet_xian_sar.txt.

Or from Python:

from pathlib import Path
from pipeline import SegEarthPipeline
from PIL import Image

pipe = SegEarthPipeline(Path("."))
image = Image.open("demo/sar.png").convert("RGB")
seg = pipe(image)

Citation

If you use this model, please cite the SegEarth-OV-2 paper:

@article{li2025segearthov2,
  title={Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images},
  author={Li, Kaiyu and Cao, Xiangyong and Liu, Ruixun and Wang, Shihong and Jiang, Zixuan and Wang, Zhi and Meng, Deyu},
  journal={arXiv preprint arXiv:2508.18067},
  year={2025}
}

License

MIT License (inherited from the original AlignEarth release).

Dependencies

  • transformers, torch, torchvision, PIL
  • Optional: featup for CUDA-accelerated JBU (falls back to pure PyTorch)
  • Optional: mmcv for CarafeUpsampler, sapa for SAPAUpsampler

Related

Downloads last month
141
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for BiliSakura/AlignEarth-SAR-ViT-B-16