| --- |
| language: en |
| license: apache-2.0 |
| datasets: |
| - zenodo |
| - huggingface |
| tags: |
| - earth-observation |
| - remote-sensing |
| - segmentation |
| - multi-modal |
| - optical |
| - sentinel-1 |
| - sentinel-2 |
| - computer-vision |
| pipeline_tag: image-segmentation |
| inference: true |
| --- |
| |
| # SkySense++ |
|
|
| SkySense++ is a semantic-enhanced multi-modal remote sensing **foundation model** for Earth observation. It fuses high-resolution optical imagery (HR), Sentinel-2 (S2), and Sentinel-1 SAR (S1) through independent backbones, an optional modality-completion VAE, and a shared transformer fusion encoder. |
|
|
| **Primary use: representation extraction.** The pretrained backbones produce rich feature representations for downstream tasks (classification, segmentation, regression). Extract `features_hr`, `features_s2`, `features_s1`, or `features_fusion` and feed them to your task-specific head. Fine-tuning on your target dataset is required. See the main [SkySensePlusPlus](https://github.com/zqcrafts/SkySense-O) repository for pretraining, 1-shot, and finetuning workflows. |
|
|
| ## Model Metadata |
|
|
| | Attribute | Value | |
| |-----------|-------| |
| | **Model type** | Multi-modal segmentation (HR + S2 + S1) | |
| | **Paper** | [SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation](https://www.nature.com/articles/s42256-025-01078-8) | |
| | **Publication** | Nature Machine Intelligence, 2025 | |
| | **License** | Apache-2.0 | |
| | **Input modalities** | High-resolution optical, Sentinel-2, Sentinel-1 | |
| | **Output** | Semantic segmentation (65 classes) | |
| | **Checkpoint contents** | Backbone weights only; segmentation head not pretrained | |
| | **HR input size** | 512×512 | |
| | **S2/S1 patch size** | 16×16 | |
|
|
| ## Model Variants |
|
|
| | Variant | Path | Sources | Use Modal VAE | Description | |
| |---------|------|---------|---------------|-------------| |
| | **full** (default) | `.` | hr, s2, s1 | Yes | All three modalities with VAE completion | |
| | **hr** | `hr/` | hr | No | High-resolution optical only | |
| | **s2** | `s2/` | s2 | No | Sentinel-2 only | |
| | **s1** | `s1/` | s1 | No | Sentinel-1 only | |
|
|
| ### Repository structure (full variant, diffusers layout) |
|
|
| ``` |
| . |
| ├── config.json |
| ├── model.safetensors |
| ├── modality_vae/ # VAE subfolder (diffusers standard) |
| │ ├── config.json |
| │ └── diffusion_pytorch_model.safetensors |
| ├── modeling_skysensepp.py |
| ├── configuration_skysensepp.py |
| ├── pipeline_skysensepp.py |
| ├── sky_sensepp_impl/ # ModalityCompletionVAE, ModalityCompletionVAEPipeline in necks/ |
| ├── hr/, s2/, s1/ # Single-modality variants |
| ``` |
|
|
| VAE loads automatically from `modality_vae/` subfolder. Legacy `modality_vae.safetensors` at root is also supported. Migrate with: |
| `python tools/split_vae_from_checkpoint.py --model-dir path/to/model --migrate` |
|
|
| ## Installation |
|
|
| ```bash |
| pip install transformers torch safetensors diffusers |
| ``` |
|
|
| The modality VAE uses diffusers `VQModel`. Legacy checkpoints (ConvVQVAEv2) load via backward-compatible fallback. |
|
|
| ## Usage |
|
|
| ### Diffusers-style loading and inference |
|
|
| The VAE follows the [diffusers](https://huggingface.co/docs/diffusers) layout: model in a `modality_vae/` subfolder with `config.json` and `diffusion_pytorch_model.safetensors`. Load and run inference like this: |
|
|
| ```python |
| import torch |
| from transformers import AutoModel |
| |
| # Load full model (VAE auto-loads from modality_vae/ subfolder, diffusers-style) |
| model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True) |
| model = model.eval().to("cuda") |
| |
| # Prepare inputs |
| hr_img = torch.randn(1, 3, 512, 512, device="cuda") |
| s2_img = torch.randn(1, 10, 2, 256, 256, device="cuda") # B, 10 bands, S steps, H, W |
| s1_img = torch.randn(1, 2, 2, 256, 256, device="cuda") # B, 2 bands, S steps, H, W |
| modalities = torch.ones(1, 3, dtype=torch.bool, device="cuda") # [hr, s2, s1] present |
| |
| # Inference |
| with torch.no_grad(): |
| out = model( |
| hr_img=hr_img, |
| s2_img=s2_img, |
| s1_img=s1_img, |
| modality_flag_hr=modalities[:, :1], |
| modality_flag_s2=modalities[:, 1:2], |
| modality_flag_s1=modalities[:, 2:], |
| return_features=True, |
| ) |
| |
| features_fusion = out["features_fusion"] |
| logits_hr = out.get("logits_hr") |
| ``` |
|
|
| ### Load VAE component only (diffusers-style) |
|
|
| ```python |
| from sky_sensepp_impl.necks import ModalityCompletionVAE |
| |
| # Load VAE from subfolder (same pattern as diffusers Stable Diffusion VAE) |
| vae = ModalityCompletionVAE.from_pretrained( |
| "path/to/SkySensepp", |
| subfolder="modality_vae", |
| ) |
| vae = vae.eval().to("cuda") |
| |
| # Run modality completion on backbone features (e.g. 2816-d, 16×16) |
| feat_hr = torch.randn(1, 2816, 16, 16, device="cuda") |
| feat_s2 = torch.randn(1, 2816, 16, 16, device="cuda") |
| feat_s1 = torch.randn(1, 2816, 16, 16, device="cuda") |
| modality_info = torch.ones(1, 3, dtype=torch.bool, device="cuda") |
| |
| with torch.no_grad(): |
| out = vae(feat_hr, feat_s2, feat_s1, modality_info) |
| |
| hr_out = out["hr_out"] |
| s2_out = out["s2_out"] |
| s1_out = out["s1_out"] |
| ``` |
|
|
| ### ModalityCompletionVAEPipeline (modular, diffusers-style) |
|
|
| ```python |
| from sky_sensepp_impl.necks import ModalityCompletionVAE, ModalityCompletionVAEPipeline |
| |
| # Load pipeline (VAE from modality_vae/ subfolder) |
| pipe = ModalityCompletionVAEPipeline.from_pretrained( |
| "path/to/SkySensepp", |
| subfolder="modality_vae", |
| ) |
| pipe = pipe.to("cuda") |
| |
| # Inference on features |
| out = pipe( |
| feat_hr=feat_hr, |
| feat_s2=feat_s2, |
| feat_s1=feat_s1, |
| modality_info=modality_info, |
| ) |
| hr_out, s2_out, s1_out = out["hr_out"], out["s2_out"], out["s1_out"] |
| |
| # Modular: inject custom VAE |
| custom_vae = ModalityCompletionVAE.from_pretrained("path/to/custom_vae") |
| pipe = ModalityCompletionVAEPipeline.from_pretrained("path/to/SkySensepp", vae=custom_vae) |
| |
| # Or swap components after load |
| pipe.register_components(vae=custom_vae) |
| ``` |
|
|
| ### Load model and attach VAE manually |
|
|
| ```python |
| model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True) |
| model.load_vae( |
| pretrained_model_name_or_path="path/to/SkySensepp", |
| subfolder="modality_vae", |
| ) |
| ``` |
|
|
| ### Variants (single-modality, no VAE) |
|
|
| ```python |
| model = AutoModel.from_pretrained("path/to/SkySensepp/hr", trust_remote_code=True) |
| model = AutoModel.from_pretrained("path/to/SkySensepp/s2", trust_remote_code=True) |
| model = AutoModel.from_pretrained("path/to/SkySensepp/s1", trust_remote_code=True) |
| ``` |
|
|
| ### Representation shapes (HR-only) |
|
|
| | Output | Shape | Description | |
| |--------|-------|-------------| |
| | `features_hr[i]` | multi-scale | Backbone features at 4 scales (stage 0–3) | |
| | `features_fusion` | `(B, 1024, H, W)` | Fused spatial representation for downstream head | |
|
|
| ## Input Formats |
|
|
| | Modality | Shape | Description | |
| |----------|-------|-------------| |
| | **hr_img** | `(B, 3, H, W)` | RGB high-res, H=W=512 typical | |
| | **s2_img** | `(B, 10, S, H, W)` | Sentinel-2, 10 bands, S time steps | |
| | **s1_img** | `(B, 2, S, H, W)` | Sentinel-1 VV/VH, S time steps | |
| |
| ## Citation |
| |
| ```bibtex |
| @article{skysensepp2025, |
| title={SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation}, |
| journal={Nature Machine Intelligence}, |
| year={2025}, |
| url={https://www.nature.com/articles/s42256-025-01078-8} |
| } |
| ``` |
| |
| ## References |
| |
| - [Project Page](https://zqcrafts.github.io/SkySense-O/project.html) |
| - [Zenodo Datasets](https://zenodo.org/records/15010418) |
| - [Hugging Face SkySense](https://huggingface.co/KKKKKKang/JL-16) |
| |