AlignEarth-SAR-ViT-B-16
CLIP-style vision-language model adapted for Synthetic Aperture Radar (SAR) imagery via knowledge distillation from optical VLMs. Enables open-vocabulary semantic segmentation for SAR remote sensing without building SAR foundation models from scratch.
This repository provides the model in Hugging Face Transformers format, converted from the original OpenCLIP-style checkpoint released by the SegEarth-OV-2 authors.
Model Details
- Architecture: CLIP (ViT-B/16 vision encoder + text encoder)
- Vision: 12-layer ViT, 768 hidden, 16ร16 patches, 224ร224 input
- Text: 12-layer transformer, 512 hidden, vocab 49408, max length 77
- Projection: 512-dim shared embedding space
- Source: likyoo/AlignEarth-SAR-ViT-B-16 (OpenCLIP format)
- Conversion: Mapped to
transformers.CLIPModelfor standard HF usage - SimFeatUp: Full upsampler suite from SegEarth-OV/simfeatup_dev:
jbu_oneโsimfeatup/xclip_jbu_one_million_aid.ckpt(default, remote-sensing)jbu_stackโsimfeatup/clip_jbu_stack_cocostuff.ckptjbu_stack_maskclipโsimfeatup/maskclip_jbu_stack_cocostuff.ckptbilinear,resize_conv,ifa(no pretrained weights)
Usage
from transformers import CLIPModel, CLIPProcessor
from PIL import Image
model = CLIPModel.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
image = Image.open("sar_image.tif")
texts = ["building", "road", "water body", "vegetation"]
inputs = processor(
text=texts,
images=image,
return_tensors="pt",
padding=True,
)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
For dense features (e.g., segmentation with SegEarth-OV-2), use the vision encoder:
from transformers import CLIPVisionModelWithProjection, CLIPImageProcessor
vision_model = CLIPVisionModelWithProjection.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPImageProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
inputs = processor(images=image, return_tensors="pt")
outputs = vision_model(**inputs)
image_embeds = outputs.image_embeds # pooled
# Or use vision_model.vision_model for patch-level features
Full Pipeline (SegEarth-OV-2 Style)
For open-vocabulary SAR segmentation with SimFeatUp dense upsampling:
from pathlib import Path
from PIL import Image
from pipeline import SegEarthPipeline
pipe = SegEarthPipeline(Path("BiliSakura/AlignEarth-SAR-ViT-B-16"))
image = Image.open("your_sar_image.tif").convert("RGB")
seg_map = pipe(image) # [H, W] class indices
The pipeline combines:
- AlignEarth CLIP encoder (SAR-adapted)
- SimFeatUp upsampler (choose
jbu_one,jbu_stack,jbu_stack_maskclip, orbilinear) - Global Bias Alleviation (cls_token_lambda) โ subtracts global context from patch features
- Logit scaling and prob threshold for robust predictions
- Sliding window for large images
- OpenEarthMap SAR class names (customize via
cls_openearthmap_sar.txtorconfigs/cls_*.txt)
# Use different featup models
pipe = SegEarthPipeline(Path("."), featup_model="jbu_one") # default
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack")
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack_maskclip")
pipe = SegEarthPipeline(Path("."), featup_model="bilinear") # no weights
# Full SegEarth-OV-2 options
pipe = SegEarthPipeline(Path("."), cls_token_lambda=-0.3, logit_scale=50, prob_thd=0)
pipe = SegEarthPipeline(Path("."), slide_crop=224, slide_stride=112) # sliding window for large images
pipe = SegEarthPipeline(Path("."), class_names_path="configs/cls_whu_sar.txt") # different dataset
Demo / Test
A paired demo sample from YESeg-OPT-SAR is in demo_YESeg-OPT-SAR/: sar.png, rgb.png, label.png. Note: This model targets SAR imagery, not optical.
python test_demo.py # uses demo_YESeg-OPT-SAR, cls_yeseg_sar, prob_thd=0.3
python test_demo.py --featup jbu_stack # try jbu_stack upsampler
python test_demo.py --save out.png # save figure
The script displays a matplotlib image grid: RGB | SAR | Label (GT) | Prediction.
Evaluation
Standalone evaluation (no mmseg) on image/label pairs:
python eval.py --img-dir data/OpenEarthMap_SAR/test/sar_images \\
--label-dir data/OpenEarthMap_SAR/test/labels \\
--config configs/cls_openearthmap_sar.txt
SAR class configs in configs/: cls_openearthmap_sar.txt, cls_whu_sar.txt, cls_hrsid.txt, cls_pie_sar.txt, cls_fusar.txt, cls_yeseg_sar.txt, cls_ddhrnet_xian_sar.txt.
Or from Python:
from pathlib import Path
from pipeline import SegEarthPipeline
from PIL import Image
pipe = SegEarthPipeline(Path("."))
image = Image.open("demo/sar.png").convert("RGB")
seg = pipe(image)
Citation
If you use this model, please cite the SegEarth-OV-2 paper:
@article{li2025segearthov2,
title={Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images},
author={Li, Kaiyu and Cao, Xiangyong and Liu, Ruixun and Wang, Shihong and Jiang, Zixuan and Wang, Zhi and Meng, Deyu},
journal={arXiv preprint arXiv:2508.18067},
year={2025}
}
License
MIT License (inherited from the original AlignEarth release).
Dependencies
transformers,torch,torchvision,PIL- Optional:
featupfor CUDA-accelerated JBU (falls back to pure PyTorch) - Optional:
mmcvfor CarafeUpsampler,sapafor SAPAUpsampler
Related
- Original weights: likyoo/AlignEarth-SAR-ViT-B-16
- Code: SegEarth-OV-2
- Paper: arXiv:2508.18067
- Downloads last month
- 141