File size: 13,034 Bytes
52efd90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# Clean Data-Scaling Study

> **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **train↔val image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.

---

## TL;DR

- **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair β€” same image annotated twice with slightly different labels.
- **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
- **Final cleaned dataset**: train 5,325 β†’ 5,268 (-57), val 1,331 β†’ 1,330 (-1).
- **Why it matters**: small effect on absolute val numbers (~≀1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ— more leaked images than 25%).
- **What we did**:
  1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
  2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 βŠ‚ 50 βŠ‚ 100 still holds).
  3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved.
  4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.

---

## 1 Β· The leakage, in detail

### How we found it

A simple integrity audit on `final_data/`:

```python
# train_files = sorted(*.jpg in final_data/train/images)
# val_files   = sorted(*.jpg in final_data/val/images)

# 1. Filename overlap?
train_names ∩ val_names  β†’  βˆ…       # 0 collisions, looked clean by name

# 2. Content overlap by md5?
train_hashes ∩ val_hashes  β†’  14 unique val images (16 train counterparts)
```

So the dataset *passed* a naΓ―ve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation.

### The 16 leaked train→val pairs

The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β€” two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:

| val image (kept) | train image (dropped) |
|---|---|
| `073102811.jpg` | `073110312.jpg` |
| `073131323.jpg` | `073156472.jpg` |
| `073134783.jpg` | `073135322.jpg` |
| `073135207.jpg` | `073211885.jpg` |
| `073140318.jpg` | `073160106.jpg` |
| `073223706.jpg` | `07313935.jpg` |
| `073237437.jpg` | `073164539.jpg` |
| `07325333.jpg` | `07350841.jpg` |
| `073255665.jpg` | `073131044.jpg` |
| `073264660.jpg` | `073248381.jpg` |
| `07331160.jpg` | `073106425.jpg` |
| `07373455.jpg` | `073108773.jpg` |
| *(plus 2 val images with 2 train copies each)* | |

Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`.

### Same image, different masks

Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**:

- During training, the network saw the pixel pattern under one annotation.
- During validation, it was scored against a different annotation of the same pixels.
- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.

### Cross-share exposure

Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:

| Data share | Leaked train copies seen |
|---|---:|
| 25% | 3 / 14 |
| 50% | 7 / 14 |
| 100% | 14 / 14 |

So in the **leaked** study the 100% model had ~5Γ— more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.

### Within-set duplication (extra cleanup)

The audit also found:

- **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
- **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.

We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`.

---

## 2 Β· Methodology

### Dataset

**Source**: `final_data/` (the original, leaky dataset β€” left untouched as historical record).
**Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py).

Three categories of removal:

| Category | Side dropped | Rationale |
|---|---|---|
| **A β€” cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice β€” the val set is sacred. |
| **B β€” within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. |
| **C β€” within-val dupes** | keep first, drop rest | Same. |

For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias.

After cleaning, sanity check confirms `train_hashes ∩ val_hashes = βˆ…`.

### Architectures

All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data:

| ID | Model | Source class | Notes |
|---|---|---|---|
| `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. |
| `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. |
| `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. |
| `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. |

### Hyperparameters (identical across models, identical to original baselines)

| | |
|---|---|
| Image size | 128 Γ— 128 |
| Optimizer | Adam, lr = 1e-4 |
| Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice |
| Loss | `0.5 Β· BCE + 0.5 Β· Dice` (`CombinedLoss`) |
| Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` |
| Epochs | 50 |
| Batch size | 16 |
| Random seed | 42 |
| Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) |

The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are:
1. Training set is deduplicated.
2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
3. Reproducible seed.

### Metrics (standardized)

We accumulate TP / FP / FN / TN over each entire epoch and compute:

| Metric | Formula |
|---|---|
| `iou` (foreground) | `TP / (TP + FP + FN)` |
| `miou` | `mean(foreground IoU, background IoU)` |
| `dice` | `2Β·TP / (2Β·TP + FP + FN)` |
| `pixel_acc` | `(TP + TN) / total` |

This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.

### Subset construction

[subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes:

- `subset_25.txt` β€” first 25% of the shuffled list
- `subset_50.txt` β€” first 50%
- `subset_100.txt` β€” full list

Asserts `25 βŠ‚ 50 βŠ‚ 100`. Plaintext, one filename per line β€” both the trainer and the dashboard read these as the single source of truth.

### What we save per run

For every (model, share) pair:

- `checkpoints/{model}_{share}_best.pth` β€” state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet)
- `checkpoints/{model}_{share}_final.pth` β€” state dict at epoch 50
- `logs/{model}_{share}.json` β€” per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps
- `logs/{model}_{share}.stdout.log` β€” captured stdout from `run_all.sh`

Logs are written incrementally β€” safe to interrupt and inspect mid-training.

---

## 3 Β· Repo layout

```
experiments/clean_data_scaling_study/
β”œβ”€β”€ README.md                     # this file
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ dedupe_dataset.py             # builds final_data_clean/ + dedup_manifest.json
β”œβ”€β”€ subsets/
β”‚   β”œβ”€β”€ make_subsets.py           # builds nested subsets from the cleaned train list
β”‚   β”œβ”€β”€ subset_25.txt             # written by make_subsets.py
β”‚   β”œβ”€β”€ subset_50.txt
β”‚   └── subset_100.txt
β”œβ”€β”€ dataset.py                    # SubsetSolarPanelDataset
β”œβ”€β”€ metrics.py                    # global confusion-matrix mIoU/IoU/Dice/PixelAcc
β”œβ”€β”€ models.py                     # 4 builders: segnet/unet/segformer_b0/segformer_b5
β”œβ”€β”€ train.py                      # unified trainer (--model, --share)
β”œβ”€β”€ run_all.sh                    # 12 trainings sequentially
β”œβ”€β”€ checkpoints/                  # populated during training (24 files: 12 best + 12 final)
β”œβ”€β”€ logs/                         # populated during training (12 JSONs + 12 stdout logs)
└── dashboard/
    └── app.py                    # Streamlit dashboard
```

---

## 4 Β· How to run

```bash
# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt

# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py            # interactive
python dedupe_dataset.py --dry-run  # report only, no copy
python dedupe_dataset.py --force    # overwrite if final_data_clean/ exists

# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py

# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh                 # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0    # one model Γ— 3 shares
python train.py --model unet --share 25              # single run

# 4. Dashboard
streamlit run dashboard/app.py
# β†’ http://localhost:8501
```

---

## 5 Β· Reading the results

The dashboard's three tabs:

1. **Learning curves** β€” switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
2. **Data share vs final** β€” best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
3. **Inference** β€” drop in any image, see the 4Γ—3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final.

---

## 6 Β· Caveats

- **128Γ—128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
- **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
- **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β€” so we *did* lose some training signal. The trade-off favors evaluation cleanliness.
- **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β€” only the *trends* in the leaked run can be cross-referenced with the trends here.