Update README.md
Browse files
README.md
CHANGED
|
@@ -12,4 +12,162 @@ tags:
|
|
| 12 |
- earth-observation
|
| 13 |
- satellite-imagery
|
| 14 |
- remote-sensing
|
| 15 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
- earth-observation
|
| 13 |
- satellite-imagery
|
| 14 |
- remote-sensing
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
# SATtxt - Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
|
| 19 |
+
<p align="center">
|
| 20 |
+
<img src="https://i.imgur.com/waxVImv.png" alt="SATtxt">
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
<p>
|
| 24 |
+
<b> Minh Kha Do, Wei Xiang, Kang Han, Di Wu, Khoa Phan, Yi-Ping Phoebe Chen, Gaowen Liu, Ramana Rao Kompella </b>
|
| 25 |
+
</p>
|
| 26 |
+
<p>
|
| 27 |
+
La Trobe University, Cisco Research
|
| 28 |
+
</p>
|
| 29 |
+
|
| 30 |
+
<p>
|
| 31 |
+
<a href="https://arxiv.org/abs/2602.22613"><img src="https://img.shields.io/badge/arXiv-2602.22613-b31b1b.svg" alt="arXiv"></a>
|
| 32 |
+
<a href="https://huggingface.co/ikhado/sattxt"><img src="https://img.shields.io/badge/🤗%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
|
| 33 |
+
<a href="https://github.com/ikhado/sattxt"><img src="https://img.shields.io/badge/Project-Page-green" alt="Project Page"></a>
|
| 34 |
+
</p>
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## 📰 News
|
| 40 |
+
|
| 41 |
+
| Date | Update |
|
| 42 |
+
|------|--------|
|
| 43 |
+
| **Mar 9, 2026** | We have released model code and weights. |
|
| 44 |
+
| **Feb 23, 2026** | SATtxt is accepted at **CVPR 2026**. We appreciate the reviewers and ACs. |
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## Overview
|
| 49 |
+
|
| 50 |
+
SATtxt is a vision-language foundation model for satellite imagery. We train **only the projection heads**, keeping both encoders frozen.
|
| 51 |
+
|
| 52 |
+
<table>
|
| 53 |
+
<tr><th>Component</th><th>Backbone</th><th>Parameters</th></tr>
|
| 54 |
+
<tr><td>Vision Encoder</td><td><a href="https://github.com/facebookresearch/dinov3">DINOv3</a> ViT-L/16</td><td>Frozen</td></tr>
|
| 55 |
+
<tr><td>Text Encoder</td><td><a href="https://github.com/McGill-NLP/llm2vec">LLM2Vec</a> Llama-3-8B</td><td>Frozen</td></tr>
|
| 56 |
+
<tr><td>Vision Head</td><td>Transformer Projection</td><td>Trained</td></tr>
|
| 57 |
+
<tr><td>Text Head</td><td>Linear Projection</td><td>Trained</td></tr>
|
| 58 |
+
</table>
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## Installation
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
git clone https://github.com/your-repo/sattxt.git
|
| 66 |
+
cd sattxt
|
| 67 |
+
pip install -r requirements.txt
|
| 68 |
+
pip install flash-attn --no-build-isolation # Required for LLM2Vec
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
---
|
| 72 |
+
|
| 73 |
+
## Model Weights
|
| 74 |
+
|
| 75 |
+
Download the required weights:
|
| 76 |
+
|
| 77 |
+
| Component | Source |
|
| 78 |
+
|-----------|--------|
|
| 79 |
+
| DINOv3 ViT-L/16 | [facebookresearch/dinov3](https://github.com/facebookresearch/dinov3) → `dinov3_vitl16_pretrain_sat493m.pth` |
|
| 80 |
+
| LLM2Vec | [McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse](https://huggingface.co/McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp-unsup-simcse) |
|
| 81 |
+
| Vision Head | [sattxt_vision_head.pt](https://huggingface.co/ikhado/sattxt/blob/main/sattxt_vision_head.pt) |
|
| 82 |
+
| Text Head | [sattxt_text_head.pt](https://huggingface.co/ikhado/sattxt/blob/main/sattxt_text_head.pt) |
|
| 83 |
+
|
| 84 |
+
Clone DINOv3 into the `thirdparty` folder:
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
cd thirdparty && git clone https://github.com/facebookresearch/dinov3.git
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Quick Start
|
| 93 |
+
|
| 94 |
+
```python
|
| 95 |
+
import sys
|
| 96 |
+
from pathlib import Path
|
| 97 |
+
|
| 98 |
+
sys.path.insert(0, str(Path(__file__).resolve().parent / "thirdparty" / "dinov3"))
|
| 99 |
+
|
| 100 |
+
from sattxt.model import SATtxt
|
| 101 |
+
from sattxt.utils import image_loader, get_preprocess, zero_shot_classify
|
| 102 |
+
|
| 103 |
+
# Load model
|
| 104 |
+
model = SATtxt(
|
| 105 |
+
dinov3_weights_path='PATH/TO/dinov3_vitl16_pretrain_sat493m.pth',
|
| 106 |
+
sattxt_vision_head_pretrain_weights='PATH/TO/sattxt_vision_head.pt',
|
| 107 |
+
text_encoder_id='McGill-NLP/LLM2Vec-Meta-Llama-3-8B-Instruct-mntp',
|
| 108 |
+
sattxt_text_head_pretrain_weights='PATH/TO/sattxt_text_head.pt'
|
| 109 |
+
).to('cuda').eval()
|
| 110 |
+
|
| 111 |
+
# Zero-shot classification
|
| 112 |
+
categories = ["AnnualCrop", "Forest", "HerbaceousVegetation", "Highway",
|
| 113 |
+
"Industrial", "Pasture", "PermanentCrop", "Residential", "River", "SeaLake"]
|
| 114 |
+
|
| 115 |
+
image = image_loader('./asset/Residential_167.jpg')
|
| 116 |
+
image_tensor = get_preprocess(is_ms=False, all_bands=False)(image).unsqueeze(0).to('cuda')
|
| 117 |
+
|
| 118 |
+
logits, pred_idx = zero_shot_classify(model, image_tensor, categories)
|
| 119 |
+
print(f"Predicted: {categories[pred_idx.item()]}") # Output: Residential
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
<details>
|
| 123 |
+
<summary><b>Expected Output</b></summary>
|
| 124 |
+
|
| 125 |
+
```
|
| 126 |
+
Image: ./asset/Residential_167.jpg
|
| 127 |
+
Predicted: Residential
|
| 128 |
+
Confidence scores:
|
| 129 |
+
AnnualCrop: -0.0075
|
| 130 |
+
Forest: -0.0633
|
| 131 |
+
HerbaceousVegetation: -0.0219
|
| 132 |
+
Highway: 0.0283
|
| 133 |
+
Industrial: 0.0887
|
| 134 |
+
Pasture: 0.0178
|
| 135 |
+
PermanentCrop: -0.0197
|
| 136 |
+
Residential: 0.0908
|
| 137 |
+
River: -0.0487
|
| 138 |
+
SeaLake: -0.0441
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
</details>
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Citation
|
| 146 |
+
|
| 147 |
+
```bibtex
|
| 148 |
+
@misc{do2026sattxt,
|
| 149 |
+
title={Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery},
|
| 150 |
+
author={Minh Kha Do and Wei Xiang and Kang Han and Di Wu and Khoa Phan and Yi-Ping Phoebe Chen and Gaowen Liu and Ramana Rao Kompella},
|
| 151 |
+
year={2026},
|
| 152 |
+
eprint={2602.22613},
|
| 153 |
+
archivePrefix={arXiv},
|
| 154 |
+
primaryClass={cs.CV},
|
| 155 |
+
url={https://arxiv.org/abs/2602.22613},
|
| 156 |
+
}
|
| 157 |
+
```
|
| 158 |
+
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## Acknowledgements
|
| 162 |
+
We pretrained the model with:
|
| 163 |
+
[Lightning-Hydra-Template](https://github.com/ashleve/lightning-hydra-template)
|
| 164 |
+
|
| 165 |
+
We use evaluation scripts from:
|
| 166 |
+
[MS-CLIP](https://github.com/IBM/MS-CLIP) and [Pangaea-Bench](https://github.com/VMarsocci/pangaea-bench)
|
| 167 |
+
|
| 168 |
+
---
|
| 169 |
+
<p>
|
| 170 |
+
We welcome contributions and issues to further improve SATtxt.
|
| 171 |
+
</p>
|
| 172 |
+
|
| 173 |
+
|