Mohamed-ENNHIRI
Solar Panel Segmentation app for HF Spaces
52efd90
# Clean Data-Scaling Study
> **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **train↔val image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.
---
## TL;DR
- **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair β€” same image annotated twice with slightly different labels.
- **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
- **Final cleaned dataset**: train 5,325 β†’ 5,268 (-57), val 1,331 β†’ 1,330 (-1).
- **Why it matters**: small effect on absolute val numbers (~≀1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ— more leaked images than 25%).
- **What we did**:
1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 βŠ‚ 50 βŠ‚ 100 still holds).
3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved.
4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.
---
## 1 Β· The leakage, in detail
### How we found it
A simple integrity audit on `final_data/`:
```python
# train_files = sorted(*.jpg in final_data/train/images)
# val_files = sorted(*.jpg in final_data/val/images)
# 1. Filename overlap?
train_names ∩ val_names β†’ βˆ… # 0 collisions, looked clean by name
# 2. Content overlap by md5?
train_hashes ∩ val_hashes β†’ 14 unique val images (16 train counterparts)
```
So the dataset *passed* a naΓ―ve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation.
### The 16 leaked train→val pairs
The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β€” two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:
| val image (kept) | train image (dropped) |
|---|---|
| `073102811.jpg` | `073110312.jpg` |
| `073131323.jpg` | `073156472.jpg` |
| `073134783.jpg` | `073135322.jpg` |
| `073135207.jpg` | `073211885.jpg` |
| `073140318.jpg` | `073160106.jpg` |
| `073223706.jpg` | `07313935.jpg` |
| `073237437.jpg` | `073164539.jpg` |
| `07325333.jpg` | `07350841.jpg` |
| `073255665.jpg` | `073131044.jpg` |
| `073264660.jpg` | `073248381.jpg` |
| `07331160.jpg` | `073106425.jpg` |
| `07373455.jpg` | `073108773.jpg` |
| *(plus 2 val images with 2 train copies each)* | |
Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`.
### Same image, different masks
Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**:
- During training, the network saw the pixel pattern under one annotation.
- During validation, it was scored against a different annotation of the same pixels.
- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.
### Cross-share exposure
Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:
| Data share | Leaked train copies seen |
|---|---:|
| 25% | 3 / 14 |
| 50% | 7 / 14 |
| 100% | 14 / 14 |
So in the **leaked** study the 100% model had ~5Γ— more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.
### Within-set duplication (extra cleanup)
The audit also found:
- **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
- **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.
We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`.
---
## 2 Β· Methodology
### Dataset
**Source**: `final_data/` (the original, leaky dataset β€” left untouched as historical record).
**Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py).
Three categories of removal:
| Category | Side dropped | Rationale |
|---|---|---|
| **A β€” cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice β€” the val set is sacred. |
| **B β€” within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. |
| **C β€” within-val dupes** | keep first, drop rest | Same. |
For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias.
After cleaning, sanity check confirms `train_hashes ∩ val_hashes = βˆ…`.
### Architectures
All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data:
| ID | Model | Source class | Notes |
|---|---|---|---|
| `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. |
| `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. |
| `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. |
| `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. |
### Hyperparameters (identical across models, identical to original baselines)
| | |
|---|---|
| Image size | 128 Γ— 128 |
| Optimizer | Adam, lr = 1e-4 |
| Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice |
| Loss | `0.5 Β· BCE + 0.5 Β· Dice` (`CombinedLoss`) |
| Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` |
| Epochs | 50 |
| Batch size | 16 |
| Random seed | 42 |
| Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) |
The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are:
1. Training set is deduplicated.
2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
3. Reproducible seed.
### Metrics (standardized)
We accumulate TP / FP / FN / TN over each entire epoch and compute:
| Metric | Formula |
|---|---|
| `iou` (foreground) | `TP / (TP + FP + FN)` |
| `miou` | `mean(foreground IoU, background IoU)` |
| `dice` | `2Β·TP / (2Β·TP + FP + FN)` |
| `pixel_acc` | `(TP + TN) / total` |
This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.
### Subset construction
[subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes:
- `subset_25.txt` β€” first 25% of the shuffled list
- `subset_50.txt` β€” first 50%
- `subset_100.txt` β€” full list
Asserts `25 βŠ‚ 50 βŠ‚ 100`. Plaintext, one filename per line β€” both the trainer and the dashboard read these as the single source of truth.
### What we save per run
For every (model, share) pair:
- `checkpoints/{model}_{share}_best.pth` β€” state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet)
- `checkpoints/{model}_{share}_final.pth` β€” state dict at epoch 50
- `logs/{model}_{share}.json` β€” per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps
- `logs/{model}_{share}.stdout.log` β€” captured stdout from `run_all.sh`
Logs are written incrementally β€” safe to interrupt and inspect mid-training.
---
## 3 Β· Repo layout
```
experiments/clean_data_scaling_study/
β”œβ”€β”€ README.md # this file
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json
β”œβ”€β”€ subsets/
β”‚ β”œβ”€β”€ make_subsets.py # builds nested subsets from the cleaned train list
β”‚ β”œβ”€β”€ subset_25.txt # written by make_subsets.py
β”‚ β”œβ”€β”€ subset_50.txt
β”‚ └── subset_100.txt
β”œβ”€β”€ dataset.py # SubsetSolarPanelDataset
β”œβ”€β”€ metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc
β”œβ”€β”€ models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5
β”œβ”€β”€ train.py # unified trainer (--model, --share)
β”œβ”€β”€ run_all.sh # 12 trainings sequentially
β”œβ”€β”€ checkpoints/ # populated during training (24 files: 12 best + 12 final)
β”œβ”€β”€ logs/ # populated during training (12 JSONs + 12 stdout logs)
└── dashboard/
└── app.py # Streamlit dashboard
```
---
## 4 Β· How to run
```bash
# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt
# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py # interactive
python dedupe_dataset.py --dry-run # report only, no copy
python dedupe_dataset.py --force # overwrite if final_data_clean/ exists
# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py
# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model Γ— 3 shares
python train.py --model unet --share 25 # single run
# 4. Dashboard
streamlit run dashboard/app.py
# β†’ http://localhost:8501
```
---
## 5 Β· Reading the results
The dashboard's three tabs:
1. **Learning curves** β€” switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
2. **Data share vs final** β€” best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
3. **Inference** β€” drop in any image, see the 4Γ—3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final.
---
## 6 Β· Caveats
- **128Γ—128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
- **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
- **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β€” so we *did* lose some training signal. The trade-off favors evaluation cleanliness.
- **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β€” only the *trends* in the leaked run can be cross-referenced with the trends here.