File size: 5,488 Bytes
bc7df31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Methane Benchmark Dataset (PINEAPPLE + Clean)

This folder contains the **Methane Benchmark Dataset** in two variants:
- **balanced**: a balanced mix of methane and non-methane patches
- **clean**: **no-methane only** (negative patches)

The dataset combines multiple modalities (HSI and RGB), **simulated Sentinel-2 BOA reflectance (S2 BOA refl)** derived from HSI, **TerraMind TiM-generated products** (including **S2L2A** and **LULC**), text captions, and labels produced by different sources (LLM, human, and TiM/TerraMind). The clean split additionally contains **Intuition-1 simulated data**.

---

## 1. Dataset overview

### 1.1 balanced (PINEAPPLE: methane + non-methane)
- **178 patches**, **27 flights**
- **HSI**: AVIRIS-NG
- **RGB**: RGB renderings / visualizations aligned with the patches
- **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boarefl_balanced/`
- **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_balanced/`):
  - **S2L2A** (TiM-generated)
  - **LULC** (TiM-generated, pixel-level)
  - Plots and auxiliary outputs
- **Annotations**
  - Urban vs. non-urban (image-level): **LLM**
  - Urban vs. non-urban (image-level): **human**
  - Textual description: **LLM**

### 1.2 clean (no-methane only)
- **261 patches** (neighboring patches; center patch excluded), **20 flights**
- **HSI**: AVIRIS-NG
- **RGB**: RGB renderings / visualizations aligned with the patches
- **Simulated Sentinel-2 (BOA reflectance)**: derived from HSI and stored under `simulated_s2_boareflclean/` (folder name preserved as exported)
- **TerraMind TiM products** (derived from simulated S2 BOA reflectance; stored under `tim_generation_clean/`):
  - **S2L2A** (TiM-generated)
  - **LULC** (TiM-generated, pixel-level)
  - Plots and auxiliary outputs
- **Intuition-1 simulated data (clean only)**: additional simulated modality for extended ablations and robustness checks (see notes in Section 2)
- **Annotations**
  - Urban vs. non-urban (image-level): **LLM**
  - Urban vs. non-urban (image-level): **human**
  - Textual description: **LLM**

---

## 2. Folder structure

Top-level directories:
- `aviris_hsi_balanced/`  
  AVIRIS-NG hyperspectral patches for the balanced split.
- `aviris_hsi_clean/`  
  AVIRIS-NG hyperspectral patches for the clean (no-methane) split.

- `rgb_balanced/`  
  RGB images for the balanced split (aligned to patches).
- `rgb_clean/`  
  RGB images for the clean split (aligned to patches).

- `captions_balanced/`  
  LLM-generated text captions/descriptions for the balanced split.
- `captions_clean/`  
  LLM-generated text captions/descriptions for the clean split.

- `simulated_s2_boarefl_balanced/`  
  Simulated Sentinel-2 BOA reflectance images for the balanced split (simulated from HSI).
- `simulated_s2_boareflclean/`  
  Simulated Sentinel-2 BOA reflectance images for the clean split (simulated from HSI; folder name preserved as exported).

- `tim_generation_balanced/`  
  TerraMind TiM outputs generated from simulated S2 BOA reflectance (balanced split).  
  Contains (at least): `s2l2a/`, `lulc/`, `classes/`, `plots/`, and auxiliary files (e.g., a legend script).
- `tim_generation_clean/`  
  TerraMind TiM outputs generated from simulated S2 BOA reflectance (clean split).  
  Contains the same product types as the balanced split.

- `I1_simulation` 
  Additional Intuition-1 simulated data aligned with clean split patches. 

Other files:
- `truth_false_labels.xlsx`  
  A compact label file (yes/no style) aggregating selected annotations (LLM, human, TiM classes), depending on your export.


---

## 3. Labels and annotation sources

The dataset provides yes/no labels and/or categorical classes from the following sources:

### 3.1 LLM labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Stored in the exported label file and/or per-sample metadata (depending on your pipeline)

### 3.2 Human labels (image-level)
- Urban vs. non-urban classification at image/patch level
- Available for at least the clean split (and optionally balanced, depending on the export)

### 3.3 TerraMind TiM products (pixel-level and per-image products)
- **S2L2A** generated by TerraMind TiM from simulated S2 BOA reflectance
- **LULC** (pixel-level) generated by TerraMind TiM from simulated S2 BOA reflectance
- Stored under `tim_generation_*` (subfolders `s2l2a/`, `lulc/`, and `classes/`)

---

## 4. Modality relationships

- **HSI (AVIRIS-NG)** is the primary observation modality.
- **RGB** is a visualization or derived view aligned to the same patch footprint.
- **Simulated Sentinel-2 BOA reflectance (S2 BOA refl)** is simulated from HSI and used as input to TiM/TerraMind.
- **S2L2A** is not directly stored as a standalone raw simulation in the root; it is produced by **TerraMind TiM** and stored inside `tim_generation_*`.
- **LULC** is produced by **TerraMind TiM** (pixel-level) and stored inside `tim_generation_*`.
- **Captions** provide text descriptions for multimodal experiments (retrieval, captioning, instruction-following, VLM/LLM alignment).
- **Intuition-1 simulated data** (clean only) provides an extra modality for robustness and domain-shift experiments.

---

## 5. Warning

Before using check dataset class if there was any changes with naming convention of the files.