Update README.md
Browse files
README.md
CHANGED
|
@@ -13,9 +13,8 @@ tags:
|
|
| 13 |
- satellite-imagery
|
| 14 |
- remote-sensing
|
| 15 |
---
|
| 16 |
-
|
| 17 |
-
|
| 18 |
# SATtxt - Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
|
|
|
|
| 19 |
<p align="center">
|
| 20 |
<img src="https://i.imgur.com/waxVImv.png" alt="SATtxt">
|
| 21 |
</p>
|
|
@@ -33,22 +32,10 @@ tags:
|
|
| 33 |
<a href="https://github.com/ikhado/sattxt"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a>
|
| 34 |
</p>
|
| 35 |
|
| 36 |
-
|
| 37 |
-
---
|
| 38 |
-
|
| 39 |
-
## 📰 News
|
| 40 |
-
|
| 41 |
-
| Date | Update |
|
| 42 |
-
|------|--------|
|
| 43 |
-
| **Mar 9, 2026** | We have released model code and weights. |
|
| 44 |
-
| **Feb 23, 2026** | SATtxt is accepted at **CVPR 2026**. We appreciate the reviewers and ACs. |
|
| 45 |
-
|
| 46 |
---
|
| 47 |
|
| 48 |
## Overview
|
| 49 |
-
|
| 50 |
SATtxt is a vision-language foundation model for satellite imagery. We train **only the projection heads**, keeping both encoders frozen.
|
| 51 |
-
|
| 52 |
<table>
|
| 53 |
<tr><th>Component</th><th>Backbone</th><th>Parameters</th></tr>
|
| 54 |
<tr><td>Vision Encoder</td><td><a href="https://github.com/facebookresearch/dinov3">DINOv3</a> ViT-L/16</td><td>Frozen</td></tr>
|
|
@@ -56,24 +43,18 @@ SATtxt is a vision-language foundation model for satellite imagery. We train **o
|
|
| 56 |
<tr><td>Vision Head</td><td>Transformer Projection</td><td>Trained</td></tr>
|
| 57 |
<tr><td>Text Head</td><td>Linear Projection</td><td>Trained</td></tr>
|
| 58 |
</table>
|
| 59 |
-
|
| 60 |
---
|
| 61 |
-
|
| 62 |
## Installation
|
| 63 |
|
| 64 |
```bash
|
| 65 |
-
git clone https://github.com/
|
| 66 |
cd sattxt
|
| 67 |
pip install -r requirements.txt
|
| 68 |
pip install flash-attn --no-build-isolation # Required for LLM2Vec
|
| 69 |
```
|
| 70 |
-
|
| 71 |
---
|
| 72 |
-
|
| 73 |
## Model Weights
|
| 74 |
-
|
| 75 |
Download the required weights:
|
| 76 |
-
|
| 77 |
| Component | Source |
|
| 78 |
|-----------|--------|
|
| 79 |
| DINOv3 ViT-L/16 | [facebookresearch/dinov3](https://github.com/facebookresearch/dinov3) → `dinov3_vitl16_pretrain_sat493m.pth` |
|
|
@@ -82,68 +63,49 @@ Download the required weights:
|
|
| 82 |
| Text Head | [sattxt_text_head.pt](https://huggingface.co/ikhado/sattxt/blob/main/sattxt_text_head.pt) |
|
| 83 |
|
| 84 |
Clone DINOv3 into the `thirdparty` folder:
|
| 85 |
-
|
| 86 |
```bash
|
| 87 |
cd thirdparty && git clone https://github.com/facebookresearch/dinov3.git
|
| 88 |
```
|
| 89 |
|
| 90 |
---
|
| 91 |
-
|
| 92 |
## Quick Start
|
| 93 |
|
| 94 |
```python
|
| 95 |
import sys
|
| 96 |
from pathlib import Path
|
| 97 |
|
|
|
|
|
|
|
| 98 |
sys.path.insert(0, str(Path(__file__).resolve().parent / "thirdparty" / "dinov3"))
|
| 99 |
|
| 100 |
from sattxt.model import SATtxt
|
| 101 |
from sattxt.utils import image_loader, get_preprocess, zero_shot_classify
|
|
|
|
| 102 |
|
| 103 |
-
# Load model
|
| 104 |
model = SATtxt(
|
| 105 |
-
dinov3_weights_path=
|
| 106 |
-
sattxt_vision_head_pretrain_weights=
|
| 107 |
-
text_encoder_id=
|
| 108 |
-
sattxt_text_head_pretrain_weights=
|
| 109 |
-
).to(
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
|
|
|
| 114 |
|
| 115 |
-
image = image_loader(
|
| 116 |
-
image_tensor = get_preprocess(is_ms=False, all_bands=False)(image).unsqueeze(0).to(
|
| 117 |
|
| 118 |
logits, pred_idx = zero_shot_classify(model, image_tensor, categories)
|
| 119 |
-
print(f"Predicted: {categories[pred_idx.item()]}") # Output: Residential
|
| 120 |
-
```
|
| 121 |
-
|
| 122 |
-
<details>
|
| 123 |
-
<summary><b>Expected Output</b></summary>
|
| 124 |
|
| 125 |
-
|
| 126 |
-
Image: ./asset/Residential_167.jpg
|
| 127 |
-
Predicted: Residential
|
| 128 |
-
Confidence scores:
|
| 129 |
-
AnnualCrop: -0.0075
|
| 130 |
-
Forest: -0.0633
|
| 131 |
-
HerbaceousVegetation: -0.0219
|
| 132 |
-
Highway: 0.0283
|
| 133 |
-
Industrial: 0.0887
|
| 134 |
-
Pasture: 0.0178
|
| 135 |
-
PermanentCrop: -0.0197
|
| 136 |
-
Residential: 0.0908
|
| 137 |
-
River: -0.0487
|
| 138 |
-
SeaLake: -0.0441
|
| 139 |
```
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
---
|
| 144 |
-
|
| 145 |
## Citation
|
| 146 |
-
|
| 147 |
```bibtex
|
| 148 |
@misc{do2026sattxt,
|
| 149 |
title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery},
|
|
@@ -155,7 +117,6 @@ Confidence scores:
|
|
| 155 |
url={https://arxiv.org/abs/2602.22613},
|
| 156 |
}
|
| 157 |
```
|
| 158 |
-
|
| 159 |
---
|
| 160 |
|
| 161 |
## Acknowledgements
|
|
@@ -165,9 +126,9 @@ We pretrained the model with:
|
|
| 165 |
We use evaluation scripts from:
|
| 166 |
[MS-CLIP](https://github.com/IBM/MS-CLIP) and [Pangaea-Bench](https://github.com/VMarsocci/pangaea-bench)
|
| 167 |
|
|
|
|
|
|
|
| 168 |
---
|
| 169 |
<p>
|
| 170 |
We welcome contributions and issues to further improve SATtxt.
|
| 171 |
-
</p>
|
| 172 |
-
|
| 173 |
-
|
|
|
|
| 13 |
- satellite-imagery
|
| 14 |
- remote-sensing
|
| 15 |
---
|
|
|
|
|
|
|
| 16 |
# SATtxt - Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
|
| 17 |
+
|
| 18 |
<p align="center">
|
| 19 |
<img src="https://i.imgur.com/waxVImv.png" alt="SATtxt">
|
| 20 |
</p>
|
|
|
|
| 32 |
<a href="https://github.com/ikhado/sattxt"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a>
|
| 33 |
</p>
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
---
|
| 36 |
|
| 37 |
## Overview
|
|
|
|
| 38 |
SATtxt is a vision-language foundation model for satellite imagery. We train **only the projection heads**, keeping both encoders frozen.
|
|
|
|
| 39 |
<table>
|
| 40 |
<tr><th>Component</th><th>Backbone</th><th>Parameters</th></tr>
|
| 41 |
<tr><td>Vision Encoder</td><td><a href="https://github.com/facebookresearch/dinov3">DINOv3</a> ViT-L/16</td><td>Frozen</td></tr>
|
|
|
|
| 43 |
<tr><td>Vision Head</td><td>Transformer Projection</td><td>Trained</td></tr>
|
| 44 |
<tr><td>Text Head</td><td>Linear Projection</td><td>Trained</td></tr>
|
| 45 |
</table>
|
|
|
|
| 46 |
---
|
|
|
|
| 47 |
## Installation
|
| 48 |
|
| 49 |
```bash
|
| 50 |
+
git clone https://github.com/ikhado/sattxt.git
|
| 51 |
cd sattxt
|
| 52 |
pip install -r requirements.txt
|
| 53 |
pip install flash-attn --no-build-isolation # Required for LLM2Vec
|
| 54 |
```
|
|
|
|
| 55 |
---
|
|
|
|
| 56 |
## Model Weights
|
|
|
|
| 57 |
Download the required weights:
|
|
|
|
| 58 |
| Component | Source |
|
| 59 |
|-----------|--------|
|
| 60 |
| DINOv3 ViT-L/16 | [facebookresearch/dinov3](https://github.com/facebookresearch/dinov3) → `dinov3_vitl16_pretrain_sat493m.pth` |
|
|
|
|
| 63 |
| Text Head | [sattxt_text_head.pt](https://huggingface.co/ikhado/sattxt/blob/main/sattxt_text_head.pt) |
|
| 64 |
|
| 65 |
Clone DINOv3 into the `thirdparty` folder:
|
|
|
|
| 66 |
```bash
|
| 67 |
cd thirdparty && git clone https://github.com/facebookresearch/dinov3.git
|
| 68 |
```
|
| 69 |
|
| 70 |
---
|
|
|
|
| 71 |
## Quick Start
|
| 72 |
|
| 73 |
```python
|
| 74 |
import sys
|
| 75 |
from pathlib import Path
|
| 76 |
|
| 77 |
+
import torch
|
| 78 |
+
|
| 79 |
sys.path.insert(0, str(Path(__file__).resolve().parent / "thirdparty" / "dinov3"))
|
| 80 |
|
| 81 |
from sattxt.model import SATtxt
|
| 82 |
from sattxt.utils import image_loader, get_preprocess, zero_shot_classify
|
| 83 |
+
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 84 |
|
|
|
|
| 85 |
model = SATtxt(
|
| 86 |
+
dinov3_weights_path="/PATH/TO/dinov3_vitl16_pretrain_sat493m-eadcf0ff.pth",
|
| 87 |
+
sattxt_vision_head_pretrain_weights="/PATH/TO/sattxt_vision_head.pt",
|
| 88 |
+
text_encoder_id="McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp",
|
| 89 |
+
sattxt_text_head_pretrain_weights="/PATH/TO/sattxt_text_head.pt",
|
| 90 |
+
).to(device).eval()
|
| 91 |
|
| 92 |
+
categories = [
|
| 93 |
+
"AnnualCrop", "Forest", "HerbaceousVegetation", "Highway", "Industrial",
|
| 94 |
+
"Pasture", "PermanentCrop", "Residential", "River", "SeaLake"
|
| 95 |
+
]
|
| 96 |
|
| 97 |
+
image = image_loader("./asset/Residential_167.jpg")
|
| 98 |
+
image_tensor = get_preprocess(is_ms=False, all_bands=False)(image).unsqueeze(0).to(device)
|
| 99 |
|
| 100 |
logits, pred_idx = zero_shot_classify(model, image_tensor, categories)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
+
print("Prediction:", categories[pred_idx.item()])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
```
|
| 104 |
|
| 105 |
+
Please check [demo.py](./demo.py) for more details.
|
| 106 |
|
| 107 |
---
|
|
|
|
| 108 |
## Citation
|
|
|
|
| 109 |
```bibtex
|
| 110 |
@misc{do2026sattxt,
|
| 111 |
title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery},
|
|
|
|
| 117 |
url={https://arxiv.org/abs/2602.22613},
|
| 118 |
}
|
| 119 |
```
|
|
|
|
| 120 |
---
|
| 121 |
|
| 122 |
## Acknowledgements
|
|
|
|
| 126 |
We use evaluation scripts from:
|
| 127 |
[MS-CLIP](https://github.com/IBM/MS-CLIP) and [Pangaea-Bench](https://github.com/VMarsocci/pangaea-bench)
|
| 128 |
|
| 129 |
+
We also use LLMs (such as ChatGPT and Claude) for code refactoring.
|
| 130 |
+
|
| 131 |
---
|
| 132 |
<p>
|
| 133 |
We welcome contributions and issues to further improve SATtxt.
|
| 134 |
+
</p>
|
|
|
|
|
|