---
language: en
license: apache-2.0
datasets:
  - zenodo
  - huggingface
tags:
  - earth-observation
  - remote-sensing
  - segmentation
  - multi-modal
  - optical
  - sentinel-1
  - sentinel-2
  - computer-vision
pipeline_tag: image-segmentation
inference: true
---

# SkySense++

SkySense++ is a semantic-enhanced multi-modal remote sensing **foundation model** for Earth observation. It fuses high-resolution optical imagery (HR), Sentinel-2 (S2), and Sentinel-1 SAR (S1) through independent backbones, an optional modality-completion VAE, and a shared transformer fusion encoder.

**Primary use: representation extraction.** The pretrained backbones produce rich feature representations for downstream tasks (classification, segmentation, regression). Extract `features_hr`, `features_s2`, `features_s1`, or `features_fusion` and feed them to your task-specific head. Fine-tuning on your target dataset is required. See the main [SkySensePlusPlus](https://github.com/zqcrafts/SkySense-O) repository for pretraining, 1-shot, and finetuning workflows.

## Model Metadata

| Attribute | Value |
|-----------|-------|
| **Model type** | Multi-modal segmentation (HR + S2 + S1) |
| **Paper** | [SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation](https://www.nature.com/articles/s42256-025-01078-8) |
| **Publication** | Nature Machine Intelligence, 2025 |
| **License** | Apache-2.0 |
| **Input modalities** | High-resolution optical, Sentinel-2, Sentinel-1 |
| **Output** | Semantic segmentation (65 classes) |
| **Checkpoint contents** | Backbone weights only; segmentation head not pretrained |
| **HR input size** | 512×512 |
| **S2/S1 patch size** | 16×16 |

## Model Variants

| Variant | Path | Sources | Use Modal VAE | Description |
|---------|------|---------|---------------|-------------|
| **full** (default) | `.` | hr, s2, s1 | Yes | All three modalities with VAE completion |
| **hr** | `hr/` | hr | No | High-resolution optical only |
| **s2** | `s2/` | s2 | No | Sentinel-2 only |
| **s1** | `s1/` | s1 | No | Sentinel-1 only |

### Repository structure (full variant, diffusers layout)

```
.
├── config.json
├── model.safetensors
├── modality_vae/                      # VAE subfolder (diffusers standard)
│   ├── config.json
│   └── diffusion_pytorch_model.safetensors
├── modeling_skysensepp.py
├── configuration_skysensepp.py
├── pipeline_skysensepp.py
├── sky_sensepp_impl/                  # ModalityCompletionVAE, ModalityCompletionVAEPipeline in necks/
├── hr/, s2/, s1/                      # Single-modality variants
```

VAE loads automatically from `modality_vae/` subfolder. Legacy `modality_vae.safetensors` at root is also supported. Migrate with:
`python tools/split_vae_from_checkpoint.py --model-dir path/to/model --migrate`

## Installation

```bash
pip install transformers torch safetensors diffusers
```

The modality VAE uses diffusers `VQModel`. Legacy checkpoints (ConvVQVAEv2) load via backward-compatible fallback.

## Usage

### Diffusers-style loading and inference

The VAE follows the [diffusers](https://huggingface.co/docs/diffusers) layout: model in a `modality_vae/` subfolder with `config.json` and `diffusion_pytorch_model.safetensors`. Load and run inference like this:

```python
import torch
from transformers import AutoModel

# Load full model (VAE auto-loads from modality_vae/ subfolder, diffusers-style)
model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True)
model = model.eval().to("cuda")

# Prepare inputs
hr_img = torch.randn(1, 3, 512, 512, device="cuda")
s2_img = torch.randn(1, 10, 2, 256, 256, device="cuda")  # B, 10 bands, S steps, H, W
s1_img = torch.randn(1, 2, 2, 256, 256, device="cuda")    # B, 2 bands, S steps, H, W
modalities = torch.ones(1, 3, dtype=torch.bool, device="cuda")  # [hr, s2, s1] present

# Inference
with torch.no_grad():
    out = model(
        hr_img=hr_img,
        s2_img=s2_img,
        s1_img=s1_img,
        modality_flag_hr=modalities[:, :1],
        modality_flag_s2=modalities[:, 1:2],
        modality_flag_s1=modalities[:, 2:],
        return_features=True,
    )

features_fusion = out["features_fusion"]
logits_hr = out.get("logits_hr")
```

### Load VAE component only (diffusers-style)

```python
from sky_sensepp_impl.necks import ModalityCompletionVAE

# Load VAE from subfolder (same pattern as diffusers Stable Diffusion VAE)
vae = ModalityCompletionVAE.from_pretrained(
    "path/to/SkySensepp",
    subfolder="modality_vae",
)
vae = vae.eval().to("cuda")

# Run modality completion on backbone features (e.g. 2816-d, 16×16)
feat_hr = torch.randn(1, 2816, 16, 16, device="cuda")
feat_s2 = torch.randn(1, 2816, 16, 16, device="cuda")
feat_s1 = torch.randn(1, 2816, 16, 16, device="cuda")
modality_info = torch.ones(1, 3, dtype=torch.bool, device="cuda")

with torch.no_grad():
    out = vae(feat_hr, feat_s2, feat_s1, modality_info)

hr_out = out["hr_out"]
s2_out = out["s2_out"]
s1_out = out["s1_out"]
```

### ModalityCompletionVAEPipeline (modular, diffusers-style)

```python
from sky_sensepp_impl.necks import ModalityCompletionVAE, ModalityCompletionVAEPipeline

# Load pipeline (VAE from modality_vae/ subfolder)
pipe = ModalityCompletionVAEPipeline.from_pretrained(
    "path/to/SkySensepp",
    subfolder="modality_vae",
)
pipe = pipe.to("cuda")

# Inference on features
out = pipe(
    feat_hr=feat_hr,
    feat_s2=feat_s2,
    feat_s1=feat_s1,
    modality_info=modality_info,
)
hr_out, s2_out, s1_out = out["hr_out"], out["s2_out"], out["s1_out"]

# Modular: inject custom VAE
custom_vae = ModalityCompletionVAE.from_pretrained("path/to/custom_vae")
pipe = ModalityCompletionVAEPipeline.from_pretrained("path/to/SkySensepp", vae=custom_vae)

# Or swap components after load
pipe.register_components(vae=custom_vae)
```

### Load model and attach VAE manually

```python
model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True)
model.load_vae(
    pretrained_model_name_or_path="path/to/SkySensepp",
    subfolder="modality_vae",
)
```

### Variants (single-modality, no VAE)

```python
model = AutoModel.from_pretrained("path/to/SkySensepp/hr", trust_remote_code=True)
model = AutoModel.from_pretrained("path/to/SkySensepp/s2", trust_remote_code=True)
model = AutoModel.from_pretrained("path/to/SkySensepp/s1", trust_remote_code=True)
```

### Representation shapes (HR-only)

| Output | Shape | Description |
|--------|-------|-------------|
| `features_hr[i]` | multi-scale | Backbone features at 4 scales (stage 0–3) |
| `features_fusion` | `(B, 1024, H, W)` | Fused spatial representation for downstream head |

## Input Formats

| Modality | Shape | Description |
|----------|-------|-------------|
| **hr_img** | `(B, 3, H, W)` | RGB high-res, H=W=512 typical |
| **s2_img** | `(B, 10, S, H, W)` | Sentinel-2, 10 bands, S time steps |
| **s1_img** | `(B, 2, S, H, W)` | Sentinel-1 VV/VH, S time steps |

## Citation

```bibtex
@article{skysensepp2025,
  title={SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation},
  journal={Nature Machine Intelligence},
  year={2025},
  url={https://www.nature.com/articles/s42256-025-01078-8}
}
```

## References

- [Project Page](https://zqcrafts.github.io/SkySense-O/project.html)
- [Zenodo Datasets](https://zenodo.org/records/15010418)
- [Hugging Face SkySense](https://huggingface.co/KKKKKKang/JL-16)