Spaces:
Running
Running
File size: 13,034 Bytes
52efd90 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 | # Clean Data-Scaling Study
> **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **trainβval image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.
---
## TL;DR
- **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair β same image annotated twice with slightly different labels.
- **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
- **Final cleaned dataset**: train 5,325 β 5,268 (-57), val 1,331 β 1,330 (-1).
- **Why it matters**: small effect on absolute val numbers (~β€1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ more leaked images than 25%).
- **What we did**:
1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 β 50 β 100 still holds).
3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved.
4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.
---
## 1 Β· The leakage, in detail
### How we found it
A simple integrity audit on `final_data/`:
```python
# train_files = sorted(*.jpg in final_data/train/images)
# val_files = sorted(*.jpg in final_data/val/images)
# 1. Filename overlap?
train_names β© val_names β β
# 0 collisions, looked clean by name
# 2. Content overlap by md5?
train_hashes β© val_hashes β 14 unique val images (16 train counterparts)
```
So the dataset *passed* a naΓ―ve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation.
### The 16 leaked trainβval pairs
The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:
| val image (kept) | train image (dropped) |
|---|---|
| `073102811.jpg` | `073110312.jpg` |
| `073131323.jpg` | `073156472.jpg` |
| `073134783.jpg` | `073135322.jpg` |
| `073135207.jpg` | `073211885.jpg` |
| `073140318.jpg` | `073160106.jpg` |
| `073223706.jpg` | `07313935.jpg` |
| `073237437.jpg` | `073164539.jpg` |
| `07325333.jpg` | `07350841.jpg` |
| `073255665.jpg` | `073131044.jpg` |
| `073264660.jpg` | `073248381.jpg` |
| `07331160.jpg` | `073106425.jpg` |
| `07373455.jpg` | `073108773.jpg` |
| *(plus 2 val images with 2 train copies each)* | |
Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`.
### Same image, different masks
Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**:
- During training, the network saw the pixel pattern under one annotation.
- During validation, it was scored against a different annotation of the same pixels.
- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.
### Cross-share exposure
Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:
| Data share | Leaked train copies seen |
|---|---:|
| 25% | 3 / 14 |
| 50% | 7 / 14 |
| 100% | 14 / 14 |
So in the **leaked** study the 100% model had ~5Γ more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.
### Within-set duplication (extra cleanup)
The audit also found:
- **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
- **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.
We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`.
---
## 2 Β· Methodology
### Dataset
**Source**: `final_data/` (the original, leaky dataset β left untouched as historical record).
**Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py).
Three categories of removal:
| Category | Side dropped | Rationale |
|---|---|---|
| **A β cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice β the val set is sacred. |
| **B β within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. |
| **C β within-val dupes** | keep first, drop rest | Same. |
For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias.
After cleaning, sanity check confirms `train_hashes β© val_hashes = β
`.
### Architectures
All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data:
| ID | Model | Source class | Notes |
|---|---|---|---|
| `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. |
| `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. |
| `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. |
| `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. |
### Hyperparameters (identical across models, identical to original baselines)
| | |
|---|---|
| Image size | 128 Γ 128 |
| Optimizer | Adam, lr = 1e-4 |
| Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice |
| Loss | `0.5 Β· BCE + 0.5 Β· Dice` (`CombinedLoss`) |
| Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` |
| Epochs | 50 |
| Batch size | 16 |
| Random seed | 42 |
| Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) |
The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are:
1. Training set is deduplicated.
2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
3. Reproducible seed.
### Metrics (standardized)
We accumulate TP / FP / FN / TN over each entire epoch and compute:
| Metric | Formula |
|---|---|
| `iou` (foreground) | `TP / (TP + FP + FN)` |
| `miou` | `mean(foreground IoU, background IoU)` |
| `dice` | `2Β·TP / (2Β·TP + FP + FN)` |
| `pixel_acc` | `(TP + TN) / total` |
This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.
### Subset construction
[subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes:
- `subset_25.txt` β first 25% of the shuffled list
- `subset_50.txt` β first 50%
- `subset_100.txt` β full list
Asserts `25 β 50 β 100`. Plaintext, one filename per line β both the trainer and the dashboard read these as the single source of truth.
### What we save per run
For every (model, share) pair:
- `checkpoints/{model}_{share}_best.pth` β state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet)
- `checkpoints/{model}_{share}_final.pth` β state dict at epoch 50
- `logs/{model}_{share}.json` β per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps
- `logs/{model}_{share}.stdout.log` β captured stdout from `run_all.sh`
Logs are written incrementally β safe to interrupt and inspect mid-training.
---
## 3 Β· Repo layout
```
experiments/clean_data_scaling_study/
βββ README.md # this file
βββ requirements.txt
βββ dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json
βββ subsets/
β βββ make_subsets.py # builds nested subsets from the cleaned train list
β βββ subset_25.txt # written by make_subsets.py
β βββ subset_50.txt
β βββ subset_100.txt
βββ dataset.py # SubsetSolarPanelDataset
βββ metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc
βββ models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5
βββ train.py # unified trainer (--model, --share)
βββ run_all.sh # 12 trainings sequentially
βββ checkpoints/ # populated during training (24 files: 12 best + 12 final)
βββ logs/ # populated during training (12 JSONs + 12 stdout logs)
βββ dashboard/
βββ app.py # Streamlit dashboard
```
---
## 4 Β· How to run
```bash
# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt
# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py # interactive
python dedupe_dataset.py --dry-run # report only, no copy
python dedupe_dataset.py --force # overwrite if final_data_clean/ exists
# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py
# 3. Train. Each run takes 5β75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model Γ 3 shares
python train.py --model unet --share 25 # single run
# 4. Dashboard
streamlit run dashboard/app.py
# β http://localhost:8501
```
---
## 5 Β· Reading the results
The dashboard's three tabs:
1. **Learning curves** β switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
2. **Data share vs final** β best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
3. **Inference** β drop in any image, see the 4Γ3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final.
---
## 6 Β· Caveats
- **128Γ128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
- **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
- **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β so we *did* lose some training signal. The trade-off favors evaluation cleanliness.
- **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β only the *trends* in the leaked run can be cross-referenced with the trends here.
|