SkySensepp / README.md
BiliSakura's picture
Update all files for SkySensepp
2322b44 verified
---
language: en
license: apache-2.0
datasets:
- zenodo
- huggingface
tags:
- earth-observation
- remote-sensing
- segmentation
- multi-modal
- optical
- sentinel-1
- sentinel-2
- computer-vision
pipeline_tag: image-segmentation
inference: true
---
# SkySense++
SkySense++ is a semantic-enhanced multi-modal remote sensing **foundation model** for Earth observation. It fuses high-resolution optical imagery (HR), Sentinel-2 (S2), and Sentinel-1 SAR (S1) through independent backbones, an optional modality-completion VAE, and a shared transformer fusion encoder.
**Primary use: representation extraction.** The pretrained backbones produce rich feature representations for downstream tasks (classification, segmentation, regression). Extract `features_hr`, `features_s2`, `features_s1`, or `features_fusion` and feed them to your task-specific head. Fine-tuning on your target dataset is required. See the main [SkySensePlusPlus](https://github.com/zqcrafts/SkySense-O) repository for pretraining, 1-shot, and finetuning workflows.
## Model Metadata
| Attribute | Value |
|-----------|-------|
| **Model type** | Multi-modal segmentation (HR + S2 + S1) |
| **Paper** | [SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation](https://www.nature.com/articles/s42256-025-01078-8) |
| **Publication** | Nature Machine Intelligence, 2025 |
| **License** | Apache-2.0 |
| **Input modalities** | High-resolution optical, Sentinel-2, Sentinel-1 |
| **Output** | Semantic segmentation (65 classes) |
| **Checkpoint contents** | Backbone weights only; segmentation head not pretrained |
| **HR input size** | 512×512 |
| **S2/S1 patch size** | 16×16 |
## Model Variants
| Variant | Path | Sources | Use Modal VAE | Description |
|---------|------|---------|---------------|-------------|
| **full** (default) | `.` | hr, s2, s1 | Yes | All three modalities with VAE completion |
| **hr** | `hr/` | hr | No | High-resolution optical only |
| **s2** | `s2/` | s2 | No | Sentinel-2 only |
| **s1** | `s1/` | s1 | No | Sentinel-1 only |
### Repository structure (full variant, diffusers layout)
```
.
├── config.json
├── model.safetensors
├── modality_vae/ # VAE subfolder (diffusers standard)
│ ├── config.json
│ └── diffusion_pytorch_model.safetensors
├── modeling_skysensepp.py
├── configuration_skysensepp.py
├── pipeline_skysensepp.py
├── sky_sensepp_impl/ # ModalityCompletionVAE, ModalityCompletionVAEPipeline in necks/
├── hr/, s2/, s1/ # Single-modality variants
```
VAE loads automatically from `modality_vae/` subfolder. Legacy `modality_vae.safetensors` at root is also supported. Migrate with:
`python tools/split_vae_from_checkpoint.py --model-dir path/to/model --migrate`
## Installation
```bash
pip install transformers torch safetensors diffusers
```
The modality VAE uses diffusers `VQModel`. Legacy checkpoints (ConvVQVAEv2) load via backward-compatible fallback.
## Usage
### Diffusers-style loading and inference
The VAE follows the [diffusers](https://huggingface.co/docs/diffusers) layout: model in a `modality_vae/` subfolder with `config.json` and `diffusion_pytorch_model.safetensors`. Load and run inference like this:
```python
import torch
from transformers import AutoModel
# Load full model (VAE auto-loads from modality_vae/ subfolder, diffusers-style)
model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True)
model = model.eval().to("cuda")
# Prepare inputs
hr_img = torch.randn(1, 3, 512, 512, device="cuda")
s2_img = torch.randn(1, 10, 2, 256, 256, device="cuda") # B, 10 bands, S steps, H, W
s1_img = torch.randn(1, 2, 2, 256, 256, device="cuda") # B, 2 bands, S steps, H, W
modalities = torch.ones(1, 3, dtype=torch.bool, device="cuda") # [hr, s2, s1] present
# Inference
with torch.no_grad():
out = model(
hr_img=hr_img,
s2_img=s2_img,
s1_img=s1_img,
modality_flag_hr=modalities[:, :1],
modality_flag_s2=modalities[:, 1:2],
modality_flag_s1=modalities[:, 2:],
return_features=True,
)
features_fusion = out["features_fusion"]
logits_hr = out.get("logits_hr")
```
### Load VAE component only (diffusers-style)
```python
from sky_sensepp_impl.necks import ModalityCompletionVAE
# Load VAE from subfolder (same pattern as diffusers Stable Diffusion VAE)
vae = ModalityCompletionVAE.from_pretrained(
"path/to/SkySensepp",
subfolder="modality_vae",
)
vae = vae.eval().to("cuda")
# Run modality completion on backbone features (e.g. 2816-d, 16×16)
feat_hr = torch.randn(1, 2816, 16, 16, device="cuda")
feat_s2 = torch.randn(1, 2816, 16, 16, device="cuda")
feat_s1 = torch.randn(1, 2816, 16, 16, device="cuda")
modality_info = torch.ones(1, 3, dtype=torch.bool, device="cuda")
with torch.no_grad():
out = vae(feat_hr, feat_s2, feat_s1, modality_info)
hr_out = out["hr_out"]
s2_out = out["s2_out"]
s1_out = out["s1_out"]
```
### ModalityCompletionVAEPipeline (modular, diffusers-style)
```python
from sky_sensepp_impl.necks import ModalityCompletionVAE, ModalityCompletionVAEPipeline
# Load pipeline (VAE from modality_vae/ subfolder)
pipe = ModalityCompletionVAEPipeline.from_pretrained(
"path/to/SkySensepp",
subfolder="modality_vae",
)
pipe = pipe.to("cuda")
# Inference on features
out = pipe(
feat_hr=feat_hr,
feat_s2=feat_s2,
feat_s1=feat_s1,
modality_info=modality_info,
)
hr_out, s2_out, s1_out = out["hr_out"], out["s2_out"], out["s1_out"]
# Modular: inject custom VAE
custom_vae = ModalityCompletionVAE.from_pretrained("path/to/custom_vae")
pipe = ModalityCompletionVAEPipeline.from_pretrained("path/to/SkySensepp", vae=custom_vae)
# Or swap components after load
pipe.register_components(vae=custom_vae)
```
### Load model and attach VAE manually
```python
model = AutoModel.from_pretrained("path/to/SkySensepp", trust_remote_code=True)
model.load_vae(
pretrained_model_name_or_path="path/to/SkySensepp",
subfolder="modality_vae",
)
```
### Variants (single-modality, no VAE)
```python
model = AutoModel.from_pretrained("path/to/SkySensepp/hr", trust_remote_code=True)
model = AutoModel.from_pretrained("path/to/SkySensepp/s2", trust_remote_code=True)
model = AutoModel.from_pretrained("path/to/SkySensepp/s1", trust_remote_code=True)
```
### Representation shapes (HR-only)
| Output | Shape | Description |
|--------|-------|-------------|
| `features_hr[i]` | multi-scale | Backbone features at 4 scales (stage 0–3) |
| `features_fusion` | `(B, 1024, H, W)` | Fused spatial representation for downstream head |
## Input Formats
| Modality | Shape | Description |
|----------|-------|-------------|
| **hr_img** | `(B, 3, H, W)` | RGB high-res, H=W=512 typical |
| **s2_img** | `(B, 10, S, H, W)` | Sentinel-2, 10 bands, S time steps |
| **s1_img** | `(B, 2, S, H, W)` | Sentinel-1 VV/VH, S time steps |
## Citation
```bibtex
@article{skysensepp2025,
title={SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation},
journal={Nature Machine Intelligence},
year={2025},
url={https://www.nature.com/articles/s42256-025-01078-8}
}
```
## References
- [Project Page](https://zqcrafts.github.io/SkySense-O/project.html)
- [Zenodo Datasets](https://zenodo.org/records/15010418)
- [Hugging Face SkySense](https://huggingface.co/KKKKKKang/JL-16)