--- language: en license: apache-2.0 datasets: - zenodo - huggingface tags: - earth-observation - remote-sensing - segmentation - multi-modal - optical - sentinel-1 - sentinel-2 - computer-vision pipeline_tag: image-segmentation inference: true --- # SkySense++ SkySense++ is a semantic-enhanced multi-modal remote sensing **foundation model** for Earth observation. It fuses high-resolution optical imagery (HR), Sentinel-2 (S2), and Sentinel-1 SAR (S1) through independent backbones, an optional modality-completion VAE, and a shared transformer fusion encoder. **Primary use: representation extraction.** The pretrained backbones produce rich feature representations for downstream tasks (classification, segmentation, regression). Extract `features_hr`, `features_s2`, `features_s1`, or `features_fusion` and feed them to your task-specific head. Fine-tuning on your target dataset is required. See the main [SkySensePlusPlus](https://github.com/zqcrafts/SkySense-O) repository for pretraining, 1-shot, and finetuning workflows. ## Model Metadata | Attribute | Value | |-----------|-------| | **Model type** | Multi-modal segmentation (HR + S2 + S1) | | **Paper** | [SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation](https://www.nature.com/articles/s42256-025-01078-8) | | **Publication** | Nature Machine Intelligence, 2025 | | **License** | Apache-2.0 | | **Input modalities** | High-resolution optical, Sentinel-2, Sentinel-1 | | **Output** | Semantic segmentation (65 classes) | | **Checkpoint contents** | Backbone weights only; segmentation head not pretrained | | **HR input size** | 512×512 | | **S2/S1 patch size** | 16×16 | ## Model Variants | Variant | Path | Sources | Use Modal VAE | Description | |---------|------|---------|---------------|-------------| | **full** (default) | `.` | hr, s2, s1 | Yes | All three modalities with VAE completion | | **hr** | `hr/` | hr | No | High-resolution optical only | | **s2** | `s2/` | s2 | No | Sentinel-2 only | | **s1** | `s1/` | s1 | No | Sentinel-1 only | ### Repository structure (full variant, diffusers layout) ``` . ├── config.json ├── model.safetensors ├── modality_vae/ # VAE subfolder (diffusers standard) │ ├── config.json │ └── diffusion_pytorch_model.safetensors ├── modeling_skysensepp.py ├── configuration_skysensepp.py ├── pipeline_skysensepp.py ├── sky_sensepp_impl/ # ModalityCompletionVAE, ModalityCompletionVAEPipeline in necks/ ├── hr/, s2/, s1/ # Single-modality variants ``` VAE loads automatically from `modality_vae/` subfolder. Legacy `modality_vae.safetensors` at root is also supported. Migrate with: `python tools/split_vae_from_checkpoint.py --model-dir path/to/model --migrate` ## Installation ```bash pip install transformers torch safetensors diffusers ``` The modality VAE uses diffusers `VQModel`. Legacy checkpoints (ConvVQVAEv2) load via backward-compatible fallback. ## Usage ### Diffusers-style loading and inference The VAE follows the [diffusers](https://huggingface.co/docs/diffusers) layout: model in a `modality_vae/` subfolder with `config.json` and `diffusion_pytorch_model.safetensors`. Load and run inference like this: ```python import torch from transformers import AutoModel # Load full model (VAE auto-loads from modality_vae/ subfolder, diffusers-style) model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True) model = model.eval().to("cuda") # Prepare inputs hr_img = torch.randn(1, 3, 512, 512, device="cuda") s2_img = torch.randn(1, 10, 2, 256, 256, device="cuda") # B, 10 bands, S steps, H, W s1_img = torch.randn(1, 2, 2, 256, 256, device="cuda") # B, 2 bands, S steps, H, W modalities = torch.ones(1, 3, dtype=torch.bool, device="cuda") # [hr, s2, s1] present # Inference with torch.no_grad(): out = model( hr_img=hr_img, s2_img=s2_img, s1_img=s1_img, modality_flag_hr=modalities[:, :1], modality_flag_s2=modalities[:, 1:2], modality_flag_s1=modalities[:, 2:], return_features=True, ) features_fusion = out["features_fusion"] logits_hr = out.get("logits_hr") ``` ### Load VAE component only (diffusers-style) ```python from sky_sensepp_impl.necks import ModalityCompletionVAE # Load VAE from subfolder (same pattern as diffusers Stable Diffusion VAE) vae = ModalityCompletionVAE.from_pretrained( "path/to/SkySensepp", subfolder="modality_vae", ) vae = vae.eval().to("cuda") # Run modality completion on backbone features (e.g. 2816-d, 16×16) feat_hr = torch.randn(1, 2816, 16, 16, device="cuda") feat_s2 = torch.randn(1, 2816, 16, 16, device="cuda") feat_s1 = torch.randn(1, 2816, 16, 16, device="cuda") modality_info = torch.ones(1, 3, dtype=torch.bool, device="cuda") with torch.no_grad(): out = vae(feat_hr, feat_s2, feat_s1, modality_info) hr_out = out["hr_out"] s2_out = out["s2_out"] s1_out = out["s1_out"] ``` ### ModalityCompletionVAEPipeline (modular, diffusers-style) ```python from sky_sensepp_impl.necks import ModalityCompletionVAE, ModalityCompletionVAEPipeline # Load pipeline (VAE from modality_vae/ subfolder) pipe = ModalityCompletionVAEPipeline.from_pretrained( "path/to/SkySensepp", subfolder="modality_vae", ) pipe = pipe.to("cuda") # Inference on features out = pipe( feat_hr=feat_hr, feat_s2=feat_s2, feat_s1=feat_s1, modality_info=modality_info, ) hr_out, s2_out, s1_out = out["hr_out"], out["s2_out"], out["s1_out"] # Modular: inject custom VAE custom_vae = ModalityCompletionVAE.from_pretrained("path/to/custom_vae") pipe = ModalityCompletionVAEPipeline.from_pretrained("path/to/SkySensepp", vae=custom_vae) # Or swap components after load pipe.register_components(vae=custom_vae) ``` ### Load model and attach VAE manually ```python model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True) model.load_vae( pretrained_model_name_or_path="path/to/SkySensepp", subfolder="modality_vae", ) ``` ### Variants (single-modality, no VAE) ```python model = AutoModel.from_pretrained("path/to/SkySensepp/hr", trust_remote_code=True) model = AutoModel.from_pretrained("path/to/SkySensepp/s2", trust_remote_code=True) model = AutoModel.from_pretrained("path/to/SkySensepp/s1", trust_remote_code=True) ``` ### Representation shapes (HR-only) | Output | Shape | Description | |--------|-------|-------------| | `features_hr[i]` | multi-scale | Backbone features at 4 scales (stage 0–3) | | `features_fusion` | `(B, 1024, H, W)` | Fused spatial representation for downstream head | ## Input Formats | Modality | Shape | Description | |----------|-------|-------------| | **hr_img** | `(B, 3, H, W)` | RGB high-res, H=W=512 typical | | **s2_img** | `(B, 10, S, H, W)` | Sentinel-2, 10 bands, S time steps | | **s1_img** | `(B, 2, S, H, W)` | Sentinel-1 VV/VH, S time steps | ## Citation ```bibtex @article{skysensepp2025, title={SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation}, journal={Nature Machine Intelligence}, year={2025}, url={https://www.nature.com/articles/s42256-025-01078-8} } ``` ## References - [Project Page](https://zqcrafts.github.io/SkySense-O/project.html) - [Zenodo Datasets](https://zenodo.org/records/15010418) - [Hugging Face SkySense](https://huggingface.co/KKKKKKang/JL-16)