File size: 9,001 Bytes
a770e46
3f5472d
4b0e4b4
 
a770e46
3f5472d
4b0e4b4
3f5472d
4b0e4b4
3f5472d
4b0e4b4
3f5472d
 
4b0e4b4
 
 
61ce7f1
 
4b0e4b4
61ce7f1
 
3f5472d
 
 
61ce7f1
3f5472d
61ce7f1
3f5472d
61ce7f1
3f5472d
61ce7f1
3f5472d
61ce7f1
 
 
4b0e4b4
61ce7f1
 
 
 
4b0e4b4
 
 
3f5472d
61ce7f1
3f5472d
4b0e4b4
 
 
3f5472d
444747c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8852fc
444747c
 
 
772c81b
444747c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c8852fc
444747c
 
 
4b0e4b4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32bed3d
4b0e4b4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61ce7f1
4b0e4b4
61ce7f1
4b0e4b4
61ce7f1
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
license: cc-by-4.0
pipeline_tag: image-feature-extraction
library_name: pytorch
---

# DOFA: Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation

[📄 Paper](https://arxiv.org/abs/2403.15356) - [💻 Code](https://github.com/zhu-xlab/DOFA)

**DOFA** (Dynamic One-For-All) is a unified, multimodal foundation framework designed for diverse vision tasks in Earth observation (EO). Inspired by neural plasticity, DOFA utilizes a wavelength-conditioned dynamic hypernetwork to process inputs from five distinct satellite sensors flexibly. By continually pretraining on five EO modalities, DOFA achieves state-of-the-art performance across multiple downstream tasks and generalizes well to unseen modalities.

## Model Details

What is DOFA: DOFA is a unified multimodal foundation model for different data modalities in remote sensing and Earth observation.

Differences with existing foundation models: DOFA is pre-trained using five different data modalities in remote sensing and Earth observation. It can handle images with any number of input channels.

DOFA is inspired by neuroplasticity. Neuroplasticity is an important brain mechanism for adjusting to new experiences or environmental shifts. Inspired by this concept, we design DOFA to emulate this mechanism for processing multimodal EO data.

For more details, please take a look at the paper [Neural Plasticity-Inspired Foundation Model for Observing the Earth Crossing Modalities](https://arxiv.org/abs/2403.15356).

### Model Description

**Why develop DOFA**

The learned multimodal representation may not effectively capture such an intersensor relationship.

The performance of foundation models will degrade when downstream tasks require the utilization of data from unseen sensors with varying numbers of spectral bands and spatial resolutions or different wavelength regimes.

The development of individual, customized foundation models requires considerably more computing resources and human efforts.

The increasing number of specialized foundation models makes it difficult to select the most appropriate one for a specific downstream task.

DOFA supports input images with any number of channels using our pre-trained foundation models.
The examples in the Github repo [DOFA](https://github.com/zhu-xlab/DOFA) show how to use DOFA for Sentinel-1 (SAR), Sentinel-2, NAIP RGB.
We will add example usage for Gaofen Multispectral, and Hyperspectral data soon.

---

-   **Developed by:** Techinical University of Munich, [Chair of Data Science in Earth Observation](https://www.asg.ed.tum.de/en/sipeo/home/)
-   **Funded by:** Ekapex, ML4Earth
-   **Model type:** Multimodal Foundation Model for Remote Sensing and Earth Observation

### Model Sources [optional]

-   **Repository:** https://github.com/zhu-xlab/DOFA
-   **Paper [optional]:** https://arxiv.org/abs/2403.15356
-   **Demo [optional]:** https://github.com/ShadowXZT/DOFA-pytorch/blob/master/demo.ipynb

---
Table 1: Linear probing results on six classification tasks. All models are trained
for 50 epochs. The reported numbers are top-1 overall accuracy (OA). Missing values
are due to the inability of the model to adapt to this domain.

| Method             | Backbone    | m-bigearthnet | m-forestnet | m-brick-kiln | m-pv4ger | m-so2sat | m-eurosat |
|--------------------|-------------|---------------|-------------|--------------|----------|----------|-----------|
| **Fully Trained**  | ViT-S       | 66.0          | 53.8        | 98.1         | 97.6     | 57.5     | 97.3      |
| **Fully Trained**  | SwinV2-T    | 70.0          | 58.0        | 98.7         | 98.0     | 56.1     | 97.4      |
| **Fully Trained**  | ConvNext-B  | 69.1          | 56.8        | 98.9         | 98.0     | 58.1     | 97.7      |
| **rand. init.**    | ViT-B       | 52.9          | 41.5        | 84.5         | 91.3     | 38.3     | 85.7      |
| **MAE_Single [44]**| ViT-B       | 63.6          | -           | 88.9         | 92.2     | 50.0     | 88.9      |
| **OFA-Net [43]**   | ViT-B       | 65.0          | -           | 94.7         | 93.2     | 49.4     | 91.9      |
| **SatMAE [25]**    | ViT-B       | 62.1          | -           | 93.9         | -        | 46.9     | 86.4      |
| **Scale-MAE [22]** | ViT-L       | -             | -           | -            | 96.9     | -        | -         |
| **GFM [21]**       | Swin-B      | -             | -           | -            | 96.8     | -        | -         |
| **Cross-Scale MAE [23]** | ViT-B | -             | -           | -            | 93.1     | -        | -         |
| **FG-MAE [24]**    | ViT-B       | 63.0          | -           | 94.7         | -        | 51.4     | 87.0      |
| **CROMA [27]**     | ViT-B       | 67.4          | -           | 91.0         | -        | 49.2     | 90.1      |
| **DOFA**           | ViT-B       | 65.7          | 50.9        | 95.8         | 96.9     | 55.1     | 93.9      |
| **DOFA**           | ViT-L       | **67.5**          | **54.6**        | **96.9**         | **97.3**     | **60.1**     | **97.1**      |



Table 2: Partial fine-tuning results on six segmentation tasks. All models are
trained with a frozen backbone for 20 epochs. Reported numbers are mean intersection
over union (mIoU). Missing values are due to the inability of the model to adapt to
this domain.

| Method             | Backbone    | m-pv4ger-seg | m-nz-cattle | m-NeonTree | m-cashew-plant | m-SA-crop | m-chesapeake |
|--------------------|-------------|--------------|-------------|------------|----------------|-----------|--------------|
| **DeepLabv3**      | ResNet101   | 93.4         | 67.6        | 53.9       | 48.6           | 30.4      | 62.1         |
| **U-Net**          | ResNet101   | 94.1         | 80.5        | 56.6       | 46.6           | 29.9      | 70.8         |
| **rand. init.**    | ViT-B       | 81.7         | 74.1        | 51.7       | 32.4           | 29.0      | 47.1         |
| **MAE_Single [44]**| ViT-B       | 88.4         | 76.4        | 53.0       | 40.7           | 30.7      | 51.9         |
| **OFA-Net [43]**   | ViT-B       | 89.4         | 77.6        | 53.3       | 47.9           | 31.9      | 54.5         |
| **Scale-MAE [22]** | ViT-L       | 83.5         | 76.5        | 51.0       | -              | -         | 61.0         |
| **GFM [21]**       | Swin-B      | 92.0         | 75.0        | 51.1       | -              | -         | 63.8         |
| **Cross-Scale MAE [23]** | ViT-B | 83.2         | 77.9        | 52.1       | -              | -         | 52.3         |
| **CROMA [27]**     | ViT-B       | -            | -           | -          | 30.1           | 31.4      | -            |
| **FG-MAE [24]**    | ViT-B       | -            | -           | -          | 40.8           | 30.6      | -            |
| **DOFA**           | ViT-B       | 94.5         | 81.4        | 58.8       | 51.5           | **33.0**  | 65.3         |
| **DOFA**           | ViT-L       | **95.0**         | **81.8**        | **59.4**       | **56.9**       | **32.1**  | **66.3**         |

---

## How to Use

Please refer to the Github repo [DOFA](https://github.com/zhu-xlab/DOFA) for more details, including `demo.ipynb` for extensive usage examples.

### Using `torch.hub` to Load the DOFA ViT Base Model

This snippet demonstrates how to load a ViT model—specifically, **DOFA ViT Base**—from a GitHub repository that includes a `hubconf.py` entrypoint. The model weights are hosted on Hugging Face via a direct download URL, so **no additional dependencies** beyond PyTorch are required.

```python
import torch

model = torch.hub.load(
    'zhu-xlab/DOFA',
    'vit_base_dofa',  # The entry point defined in hubconf.py
    pretrained=True,
)

model = model.cuda()
model.eval()
```

Now the model is ready for inference or further fine-tuning.
If you would like to fine-tune DOFA on different downstream tasks, please refer to [DOFA-pytorch](https://github.com/xiong-zhitong/DOFA-pytorch).

### TorchGeo

Alternatively, DOFA can be used via the [TorchGeo](https://github.com/microsoft/torchgeo) library:

```python
import torch
from torchgeo.models import DOFABase16_Weights, dofa_base_patch16_224

# Example NAIP image (wavelengths in $\mu$m)
x = torch.rand(2, 4, 224, 224)
wavelengths = [0.48, 0.56, 0.64, 0.81]

# Use pre-trained model weights
model = dofa_base_patch16_224(weights=DOFABase16_Weights.DOFA_MAE)

# Make a prediction (model may need to be fine-tuned first)
y = model(x, wavelengths)
```

## Citation

If you find the DOFA useful in your research, please kindly cite our paper:
```
@article{xiong2024neural,
  title={Neural Plasticity-Inspired Foundation Model for Observing the {Earth} Crossing Modalities},
  author={Xiong, Zhitong and Wang, Yi and Zhang, Fahong and Stewart, Adam J and Hanna, Jo{\"e}lle and Borth, Damian and Papoutsis, Ioannis and Saux, Bertrand Le and Camps-Valls, Gustau and Zhu, Xiao Xiang},
  journal={arXiv preprint arXiv:2403.15356},
  year={2024}
}
```