Spaces:

phiniqs
/

seg-models

Sleeping

App Files Files Community

seg-models / experiments /clean_data_scaling_study /README.md

Mohamed-ENNHIRI

Solar Panel Segmentation app for HF Spaces

52efd90 24 days ago

preview code

raw

history blame contribute delete

13 kB

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

Clean Data-Scaling Study

Why this experiment exists: a previous data-scaling study (experiments/data_scaling_study/) was run on a dataset with a small but real train↔val image leak. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.

TL;DR

What we found: 14 of 1,331 validation images (1.05%) had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had two byte-identical copies in train each, so removing the leak required dropping 16 train files. Corresponding masks differed for every pair — same image annotated twice with slightly different labels.
Plus: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
Final cleaned dataset: train 5,325 → 5,268 (-57), val 1,331 → 1,330 (-1).
Why it matters: small effect on absolute val numbers (~≤1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5× more leaked images than 25%).
What we did:
1. Built a deduplicated final_data_clean/: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same seed=42, so 25 ⊂ 50 ⊂ 100 still holds).
3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = 12 trainings from scratch, no bootstrapping, both best- and final-epoch checkpoints saved.
4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.

1 · The leakage, in detail

How we found it

A simple integrity audit on final_data/:

# train_files = sorted(*.jpg in final_data/train/images)
# val_files   = sorted(*.jpg in final_data/val/images)

# 1. Filename overlap?
train_names ∩ val_names  →  ∅       # 0 collisions, looked clean by name

# 2. Content overlap by md5?
train_hashes ∩ val_hashes  →  14 unique val images (16 train counterparts)

So the dataset passed a naïve filename-only check but failed the content-hash check. That's how this kind of leakage typically slips past curation.

The 16 leaked train→val pairs

The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images — two val images have two byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:

val image (kept)	train image (dropped)
`073102811.jpg`	`073110312.jpg`
`073131323.jpg`	`073156472.jpg`
`073134783.jpg`	`073135322.jpg`
`073135207.jpg`	`073211885.jpg`
`073140318.jpg`	`073160106.jpg`
`073223706.jpg`	`07313935.jpg`
`073237437.jpg`	`073164539.jpg`
`07325333.jpg`	`07350841.jpg`
`073255665.jpg`	`073131044.jpg`
`073264660.jpg`	`073248381.jpg`
`07331160.jpg`	`073106425.jpg`
`07373455.jpg`	`073108773.jpg`
(plus 2 val images with 2 train copies each)

Full list in final_data_clean/dedup_manifest.json under category_A_cross_leak.

Same image, different masks

Curiously, the corresponding _mask.png files are not identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is image-content leakage, not label leakage:

During training, the network saw the pixel pattern under one annotation.
During validation, it was scored against a different annotation of the same pixels.
Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.

Cross-share exposure

Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:

Data share	Leaked train copies seen
25%	3 / 14
50%	7 / 14
100%	14 / 14

So in the leaked study the 100% model had ~5× more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.

Within-set duplication (extra cleanup)

The audit also found:

Within train: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
Within val: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.

We deduplicated all three categories. Full manifest in final_data_clean/dedup_manifest.json.

2 · Methodology

Dataset

Source: final_data/ (the original, leaky dataset — left untouched as historical record). Cleaned copy: final_data_clean/, built by dedupe_dataset.py.

Three categories of removal:

Category	Side dropped	Rationale
A — cross-leak (val image's bytes appear in train)	drop from train	Preserves val set integrity. Standard practice — the val set is sacred.
B — within-train dupes	keep first (alphabetical), drop rest	Keeps one canonical copy per unique image.
C — within-val dupes	keep first, drop rest	Same.

For every dropped file the dedup_manifest.json records: filename, side (train / val), reason (cross_leak / train_dup / val_dup), and the kept alias.

After cleaning, sanity check confirms train_hashes ∩ val_hashes = ∅.

Architectures

All 4 baselines from pv_panel_models/, trained from scratch on the cleaned data:

ID	Model	Source class	Notes
`segnet`	SegNet (CNN)	`pv_panel_models/cnn_model/cnn_segmenter.py`	encoder/decoder w/ MaxPool indices for unpooling. forward applies sigmoid.
`unet`	U-Net	`pv_panel_models/unet_model/unet_model.py`	classic skip-concatenation.
`segformer_b0`	SegFormer mit-b0	`pv_panel_models/vit_model/segformer_model.py`	HuggingFace small.
`segformer_b5`	SegFormer mit-b5	`pv_panel_models/segformer_b5_model/segformer_model.py`	HuggingFace large.

Hyperparameters (identical across models, identical to original baselines)


Image size	128 × 128
Optimizer	Adam, lr = 1e-4
Scheduler	`ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice
Loss	`0.5 · BCE + 0.5 · Dice` (`CombinedLoss`)
Augmentations	`RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)`
Epochs	50
Batch size	16
Random seed	42
Subset selection seed	42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup)

The point of holding hyperparameters fixed is that the only intentional differences between this study and the original baselines are:

Training set is deduplicated.
Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
Reproducible seed.

Metrics (standardized)

We accumulate TP / FP / FN / TN over each entire epoch and compute:

Metric	Formula
`iou` (foreground)	`TP / (TP + FP + FN)`
`miou`	`mean(foreground IoU, background IoU)`
`dice`	`2·TP / (2·TP + FP + FN)`
`pixel_acc`	`(TP + TN) / total`

This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.

Subset construction

subsets/make_subsets.py reads final_data_clean/train/images/, sorts filenames, shuffles once with random.Random(42), and writes:

subset_25.txt — first 25% of the shuffled list
subset_50.txt — first 50%
subset_100.txt — full list

Asserts 25 ⊂ 50 ⊂ 100. Plaintext, one filename per line — both the trainer and the dashboard read these as the single source of truth.

What we save per run

For every (model, share) pair:

checkpoints/{model}_{share}_best.pth — state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and output_is_prob flag for SegNet)
checkpoints/{model}_{share}_final.pth — state dict at epoch 50
logs/{model}_{share}.json — per-epoch JSON with train_* / val_* for {loss, dice, iou, miou, pixel_acc}, plus epoch_seconds, train_seconds, val_seconds, plus top-level wall-clock totals and ISO timestamps
logs/{model}_{share}.stdout.log — captured stdout from run_all.sh

Logs are written incrementally — safe to interrupt and inspect mid-training.

3 · Repo layout

experiments/clean_data_scaling_study/
├── README.md                     # this file
├── requirements.txt
├── dedupe_dataset.py             # builds final_data_clean/ + dedup_manifest.json
├── subsets/
│   ├── make_subsets.py           # builds nested subsets from the cleaned train list
│   ├── subset_25.txt             # written by make_subsets.py
│   ├── subset_50.txt
│   └── subset_100.txt
├── dataset.py                    # SubsetSolarPanelDataset
├── metrics.py                    # global confusion-matrix mIoU/IoU/Dice/PixelAcc
├── models.py                     # 4 builders: segnet/unet/segformer_b0/segformer_b5
├── train.py                      # unified trainer (--model, --share)
├── run_all.sh                    # 12 trainings sequentially
├── checkpoints/                  # populated during training (24 files: 12 best + 12 final)
├── logs/                         # populated during training (12 JSONs + 12 stdout logs)
└── dashboard/
    └── app.py                    # Streamlit dashboard

4 · How to run

# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt

# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py            # interactive
python dedupe_dataset.py --dry-run  # report only, no copy
python dedupe_dataset.py --force    # overwrite if final_data_clean/ exists

# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py

# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh                 # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0    # one model × 3 shares
python train.py --model unet --share 25              # single run

# 4. Dashboard
streamlit run dashboard/app.py
# → http://localhost:8501

5 · Reading the results

The dashboard's three tabs:

Learning curves — switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
Data share vs final — best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
Inference — drop in any image, see the 4×3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (mask / overlay / heatmap), and best vs final.

6 · Caveats

128×128 resolution. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
Single seed. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
Mask inconsistency for the 14 leaked pairs. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask — so we did lose some training signal. The trade-off favors evaluation cleanliness.
Comparison to the leaked run. The previous experiments/data_scaling_study/ used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable — only the trends in the leaked run can be cross-referenced with the trends here.