AlignEarth-SAR-ViT-B-16

CLIP-style vision-language model adapted for Synthetic Aperture Radar (SAR) imagery via knowledge distillation from optical VLMs. Enables open-vocabulary semantic segmentation for SAR remote sensing without building SAR foundation models from scratch.

This repository provides the model in Hugging Face Transformers format, converted from the original OpenCLIP-style checkpoint released by the SegEarth-OV-2 authors.

Model Details

Architecture: CLIP (ViT-B/16 vision encoder + text encoder)
Vision: 12-layer ViT, 768 hidden, 16×16 patches, 224×224 input
Text: 12-layer transformer, 512 hidden, vocab 49408, max length 77
Projection: 512-dim shared embedding space
Source: likyoo/AlignEarth-SAR-ViT-B-16 (OpenCLIP format)
Conversion: Mapped to transformers.CLIPModel for standard HF usage
SimFeatUp: Full upsampler suite from SegEarth-OV/simfeatup_dev:
- jbu_one → simfeatup/xclip_jbu_one_million_aid.ckpt (default, remote-sensing)
- jbu_stack → simfeatup/clip_jbu_stack_cocostuff.ckpt
- jbu_stack_maskclip → simfeatup/maskclip_jbu_stack_cocostuff.ckpt
- bilinear, resize_conv, ifa (no pretrained weights)

Usage

from transformers import CLIPModel, CLIPProcessor
from PIL import Image

model = CLIPModel.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")

image = Image.open("sar_image.tif")
texts = ["building", "road", "water body", "vegetation"]

inputs = processor(
    text=texts,
    images=image,
    return_tensors="pt",
    padding=True,
)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

For dense features (e.g., segmentation with SegEarth-OV-2), use the vision encoder:

from transformers import CLIPVisionModelWithProjection, CLIPImageProcessor

vision_model = CLIPVisionModelWithProjection.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")
processor = CLIPImageProcessor.from_pretrained("BiliSakura/AlignEarth-SAR-ViT-B-16")

inputs = processor(images=image, return_tensors="pt")
outputs = vision_model(**inputs)
image_embeds = outputs.image_embeds  # pooled
# Or use vision_model.vision_model for patch-level features

Full Pipeline (SegEarth-OV-2 Style)

For open-vocabulary SAR segmentation with SimFeatUp dense upsampling:

from pathlib import Path
from PIL import Image
from pipeline import SegEarthPipeline

pipe = SegEarthPipeline(Path("BiliSakura/AlignEarth-SAR-ViT-B-16"))
image = Image.open("your_sar_image.tif").convert("RGB")
seg_map = pipe(image)  # [H, W] class indices

The pipeline combines:

AlignEarth CLIP encoder (SAR-adapted)
SimFeatUp upsampler (choose jbu_one, jbu_stack, jbu_stack_maskclip, or bilinear)
Global Bias Alleviation (cls_token_lambda) – subtracts global context from patch features
Logit scaling and prob threshold for robust predictions
Sliding window for large images
OpenEarthMap SAR class names (customize via cls_openearthmap_sar.txt or configs/cls_*.txt)

# Use different featup models
pipe = SegEarthPipeline(Path("."), featup_model="jbu_one")       # default
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack")
pipe = SegEarthPipeline(Path("."), featup_model="jbu_stack_maskclip")
pipe = SegEarthPipeline(Path("."), featup_model="bilinear")       # no weights

# Full SegEarth-OV-2 options
pipe = SegEarthPipeline(Path("."), cls_token_lambda=-0.3, logit_scale=50, prob_thd=0)
pipe = SegEarthPipeline(Path("."), slide_crop=224, slide_stride=112)  # sliding window for large images
pipe = SegEarthPipeline(Path("."), class_names_path="configs/cls_whu_sar.txt")  # different dataset

Demo / Test

A paired demo sample from YESeg-OPT-SAR is in demo_YESeg-OPT-SAR/: sar.png, rgb.png, label.png. Note: This model targets SAR imagery, not optical.

python test_demo.py                         # uses demo_YESeg-OPT-SAR, cls_yeseg_sar, prob_thd=0.3
python test_demo.py --featup jbu_stack     # try jbu_stack upsampler
python test_demo.py --save out.png          # save figure

The script displays a matplotlib image grid: RGB | SAR | Label (GT) | Prediction.

Evaluation

Standalone evaluation (no mmseg) on image/label pairs:

python eval.py --img-dir data/OpenEarthMap_SAR/test/sar_images \\
               --label-dir data/OpenEarthMap_SAR/test/labels \\
               --config configs/cls_openearthmap_sar.txt

SAR class configs in configs/: cls_openearthmap_sar.txt, cls_whu_sar.txt, cls_hrsid.txt, cls_pie_sar.txt, cls_fusar.txt, cls_yeseg_sar.txt, cls_ddhrnet_xian_sar.txt.

Or from Python:

from pathlib import Path
from pipeline import SegEarthPipeline
from PIL import Image

pipe = SegEarthPipeline(Path("."))
image = Image.open("demo/sar.png").convert("RGB")
seg = pipe(image)

Citation

If you use this model, please cite the SegEarth-OV-2 paper:

@article{li2025segearthov2,
  title={Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images},
  author={Li, Kaiyu and Cao, Xiangyong and Liu, Ruixun and Wang, Shihong and Jiang, Zixuan and Wang, Zhi and Meng, Deyu},
  journal={arXiv preprint arXiv:2508.18067},
  year={2025}
}

License

MIT License (inherited from the original AlignEarth release).

Dependencies

transformers, torch, torchvision, PIL
Optional: featup for CUDA-accelerated JBU (falls back to pure PyTorch)
Optional: mmcv for CarafeUpsampler, sapa for SAPAUpsampler

Original weights: likyoo/AlignEarth-SAR-ViT-B-16
Code: SegEarth-OV-2
Paper: arXiv:2508.18067

Downloads last month: 141

Safetensors

Model size

0.1B params

Tensor type

F32

Paper for BiliSakura/AlignEarth-SAR-ViT-B-16

Annotation-Free Open-Vocabulary Segmentation for Remote-Sensing Images

Paper • 2508.18067 • Published Aug 25, 2025

BiliSakura
/

AlignEarth-SAR-ViT-B-16