Spaces:

phiniqs
/

seg-models

Sleeping

App Files Files Community

seg-models / experiments /clean_data_scaling_study /README.md

Mohamed-ENNHIRI

Solar Panel Segmentation app for HF Spaces

52efd90 25 days ago

preview code

raw

history blame contribute delete

13 kB

	# Clean Data-Scaling Study

	> Why this experiment exists: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real train↔val image leak. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.

	---

	## TL;DR

	- What we found: 14 of 1,331 validation images (1.05%) had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had two byte-identical copies in train each, so removing the leak required dropping 16 train files. Corresponding masks differed for every pair — same image annotated twice with slightly different labels.
	- Plus: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
	- Final cleaned dataset: train 5,325 → 5,268 (-57), val 1,331 → 1,330 (-1).
	- Why it matters: small effect on absolute val numbers (~≤1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5× more leaked images than 25%).
	- What we did:
	1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
	2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 ⊂ 50 ⊂ 100 still holds).
	3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = 12 trainings from scratch, no bootstrapping, both best- and final-epoch checkpoints saved.
	4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.

	---

	## 1 · The leakage, in detail

	### How we found it

	A simple integrity audit on `final_data/`:

	```python
	# train_files = sorted(*.jpg in final_data/train/images)
	# val_files = sorted(*.jpg in final_data/val/images)

	# 1. Filename overlap?
	train_names ∩ val_names → ∅ # 0 collisions, looked clean by name

	# 2. Content overlap by md5?
	train_hashes ∩ val_hashes → 14 unique val images (16 train counterparts)
	```

	So the dataset passed a naïve filename-only check but failed the content-hash check. That's how this kind of leakage typically slips past curation.

	### The 16 leaked train→val pairs

	The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images — two val images have two byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:

	\| val image (kept) \| train image (dropped) \|
	\|---\|---\|
	\| `073102811.jpg` \| `073110312.jpg` \|
	\| `073131323.jpg` \| `073156472.jpg` \|
	\| `073134783.jpg` \| `073135322.jpg` \|
	\| `073135207.jpg` \| `073211885.jpg` \|
	\| `073140318.jpg` \| `073160106.jpg` \|
	\| `073223706.jpg` \| `07313935.jpg` \|
	\| `073237437.jpg` \| `073164539.jpg` \|
	\| `07325333.jpg` \| `07350841.jpg` \|
	\| `073255665.jpg` \| `073131044.jpg` \|
	\| `073264660.jpg` \| `073248381.jpg` \|
	\| `07331160.jpg` \| `073106425.jpg` \|
	\| `07373455.jpg` \| `073108773.jpg` \|
	\| (plus 2 val images with 2 train copies each) \| \|

	Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`.

	### Same image, different masks

	Curiously, the corresponding `_mask.png` files are not identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is image-content leakage, not label leakage:

	- During training, the network saw the pixel pattern under one annotation.
	- During validation, it was scored against a different annotation of the same pixels.
	- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.

	### Cross-share exposure

	Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:

	\| Data share \| Leaked train copies seen \|
	\|---\|---:\|
	\| 25% \| 3 / 14 \|
	\| 50% \| 7 / 14 \|
	\| 100% \| 14 / 14 \|

	So in the leaked study the 100% model had ~5× more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.

	### Within-set duplication (extra cleanup)

	The audit also found:

	- Within train: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
	- Within val: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.

	We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`.

	---

	## 2 · Methodology

	### Dataset

	Source: `final_data/` (the original, leaky dataset — left untouched as historical record).
	Cleaned copy: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py).

	Three categories of removal:

	\| Category \| Side dropped \| Rationale \|
	\|---\|---\|---\|
	\| A — cross-leak (val image's bytes appear in train) \| drop from train \| Preserves val set integrity. Standard practice — the val set is sacred. \|
	\| B — within-train dupes \| keep first (alphabetical), drop rest \| Keeps one canonical copy per unique image. \|
	\| C — within-val dupes \| keep first, drop rest \| Same. \|

	For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias.

	After cleaning, sanity check confirms `train_hashes ∩ val_hashes = ∅`.

	### Architectures

	All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data:

	\| ID \| Model \| Source class \| Notes \|
	\|---\|---\|---\|---\|
	\| `segnet` \| SegNet (CNN) \| [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) \| encoder/decoder w/ MaxPool indices for unpooling. forward applies sigmoid. \|
	\| `unet` \| U-Net \| [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) \| classic skip-concatenation. \|
	\| `segformer_b0` \| SegFormer mit-b0 \| [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) \| HuggingFace small. \|
	\| `segformer_b5` \| SegFormer mit-b5 \| [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) \| HuggingFace large. \|

	### Hyperparameters (identical across models, identical to original baselines)

	\| \| \|
	\|---\|---\|
	\| Image size \| 128 × 128 \|
	\| Optimizer \| Adam, lr = 1e-4 \|
	\| Scheduler \| `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice \|
	\| Loss \| `0.5 · BCE + 0.5 · Dice` (`CombinedLoss`) \|
	\| Augmentations \| `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` \|
	\| Epochs \| 50 \|
	\| Batch size \| 16 \|
	\| Random seed \| 42 \|
	\| Subset selection seed \| 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) \|

	The point of holding hyperparameters fixed is that the only intentional differences between this study and the original baselines are:
	1. Training set is deduplicated.
	2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
	3. Reproducible seed.

	### Metrics (standardized)

	We accumulate TP / FP / FN / TN over each entire epoch and compute:

	\| Metric \| Formula \|
	\|---\|---\|
	\| `iou` (foreground) \| `TP / (TP + FP + FN)` \|
	\| `miou` \| `mean(foreground IoU, background IoU)` \|
	\| `dice` \| `2·TP / (2·TP + FP + FN)` \|
	\| `pixel_acc` \| `(TP + TN) / total` \|

	This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.

	### Subset construction

	[subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes:

	- `subset_25.txt` — first 25% of the shuffled list
	- `subset_50.txt` — first 50%
	- `subset_100.txt` — full list

	Asserts `25 ⊂ 50 ⊂ 100`. Plaintext, one filename per line — both the trainer and the dashboard read these as the single source of truth.

	### What we save per run

	For every (model, share) pair:

	- `checkpoints/{model}_{share}_best.pth` — state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet)
	- `checkpoints/{model}_{share}_final.pth` — state dict at epoch 50
	- `logs/{model}_{share}.json` — per-epoch JSON with `train_` / `val_` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps
	- `logs/{model}_{share}.stdout.log` — captured stdout from `run_all.sh`

	Logs are written incrementally — safe to interrupt and inspect mid-training.

	---

	## 3 · Repo layout

	```
	experiments/clean_data_scaling_study/
	├── README.md # this file
	├── requirements.txt
	├── dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json
	├── subsets/
	│ ├── make_subsets.py # builds nested subsets from the cleaned train list
	│ ├── subset_25.txt # written by make_subsets.py
	│ ├── subset_50.txt
	│ └── subset_100.txt
	├── dataset.py # SubsetSolarPanelDataset
	├── metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc
	├── models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5
	├── train.py # unified trainer (--model, --share)
	├── run_all.sh # 12 trainings sequentially
	├── checkpoints/ # populated during training (24 files: 12 best + 12 final)
	├── logs/ # populated during training (12 JSONs + 12 stdout logs)
	└── dashboard/
	└── app.py # Streamlit dashboard
	```

	---

	## 4 · How to run

	```bash
	# 0. Install (system python on this machine; adjust if you use a venv)
	pip install --user --break-system-packages -r requirements.txt

	# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
	python dedupe_dataset.py # interactive
	python dedupe_dataset.py --dry-run # report only, no copy
	python dedupe_dataset.py --force # overwrite if final_data_clean/ exists

	# 2. Build the nested subsets (writes subsets/subset_*.txt)
	python subsets/make_subsets.py

	# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
	PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours)
	PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model × 3 shares
	python train.py --model unet --share 25 # single run

	# 4. Dashboard
	streamlit run dashboard/app.py
	# → http://localhost:8501
	```

	---

	## 5 · Reading the results

	The dashboard's three tabs:

	1. Learning curves — switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
	2. Data share vs final — best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
	3. Inference — drop in any image, see the 4×3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final.

	---

	## 6 · Caveats

	- 128×128 resolution. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
	- Single seed. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
	- Mask inconsistency for the 14 leaked pairs. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask — so we did lose some training signal. The trade-off favors evaluation cleanliness.
	- Comparison to the leaked run. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable — only the trends in the leaked run can be cross-referenced with the trends here.