Mohamed-ENNHIRI
Solar Panel Segmentation app for HF Spaces
52efd90

A newer version of the Streamlit SDK is available: 1.58.0

Upgrade

Clean Data-Scaling Study

Why this experiment exists: a previous data-scaling study (experiments/data_scaling_study/) was run on a dataset with a small but real train↔val image leak. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.


TL;DR

  • What we found: 14 of 1,331 validation images (1.05%) had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had two byte-identical copies in train each, so removing the leak required dropping 16 train files. Corresponding masks differed for every pair β€” same image annotated twice with slightly different labels.
  • Plus: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
  • Final cleaned dataset: train 5,325 β†’ 5,268 (-57), val 1,331 β†’ 1,330 (-1).
  • Why it matters: small effect on absolute val numbers (~≀1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ— more leaked images than 25%).
  • What we did:
    1. Built a deduplicated final_data_clean/: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
    2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same seed=42, so 25 βŠ‚ 50 βŠ‚ 100 still holds).
    3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = 12 trainings from scratch, no bootstrapping, both best- and final-epoch checkpoints saved.
    4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.

1 Β· The leakage, in detail

How we found it

A simple integrity audit on final_data/:

# train_files = sorted(*.jpg in final_data/train/images)
# val_files   = sorted(*.jpg in final_data/val/images)

# 1. Filename overlap?
train_names ∩ val_names  β†’  βˆ…       # 0 collisions, looked clean by name

# 2. Content overlap by md5?
train_hashes ∩ val_hashes  β†’  14 unique val images (16 train counterparts)

So the dataset passed a naΓ―ve filename-only check but failed the content-hash check. That's how this kind of leakage typically slips past curation.

The 16 leaked train→val pairs

The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β€” two val images have two byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:

val image (kept) train image (dropped)
073102811.jpg 073110312.jpg
073131323.jpg 073156472.jpg
073134783.jpg 073135322.jpg
073135207.jpg 073211885.jpg
073140318.jpg 073160106.jpg
073223706.jpg 07313935.jpg
073237437.jpg 073164539.jpg
07325333.jpg 07350841.jpg
073255665.jpg 073131044.jpg
073264660.jpg 073248381.jpg
07331160.jpg 073106425.jpg
07373455.jpg 073108773.jpg
(plus 2 val images with 2 train copies each)

Full list in final_data_clean/dedup_manifest.json under category_A_cross_leak.

Same image, different masks

Curiously, the corresponding _mask.png files are not identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is image-content leakage, not label leakage:

  • During training, the network saw the pixel pattern under one annotation.
  • During validation, it was scored against a different annotation of the same pixels.
  • Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.

Cross-share exposure

Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:

Data share Leaked train copies seen
25% 3 / 14
50% 7 / 14
100% 14 / 14

So in the leaked study the 100% model had ~5Γ— more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.

Within-set duplication (extra cleanup)

The audit also found:

  • Within train: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
  • Within val: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.

We deduplicated all three categories. Full manifest in final_data_clean/dedup_manifest.json.


2 Β· Methodology

Dataset

Source: final_data/ (the original, leaky dataset β€” left untouched as historical record). Cleaned copy: final_data_clean/, built by dedupe_dataset.py.

Three categories of removal:

Category Side dropped Rationale
A β€” cross-leak (val image's bytes appear in train) drop from train Preserves val set integrity. Standard practice β€” the val set is sacred.
B β€” within-train dupes keep first (alphabetical), drop rest Keeps one canonical copy per unique image.
C β€” within-val dupes keep first, drop rest Same.

For every dropped file the dedup_manifest.json records: filename, side (train / val), reason (cross_leak / train_dup / val_dup), and the kept alias.

After cleaning, sanity check confirms train_hashes ∩ val_hashes = βˆ….

Architectures

All 4 baselines from pv_panel_models/, trained from scratch on the cleaned data:

ID Model Source class Notes
segnet SegNet (CNN) pv_panel_models/cnn_model/cnn_segmenter.py encoder/decoder w/ MaxPool indices for unpooling. forward applies sigmoid.
unet U-Net pv_panel_models/unet_model/unet_model.py classic skip-concatenation.
segformer_b0 SegFormer mit-b0 pv_panel_models/vit_model/segformer_model.py HuggingFace small.
segformer_b5 SegFormer mit-b5 pv_panel_models/segformer_b5_model/segformer_model.py HuggingFace large.

Hyperparameters (identical across models, identical to original baselines)

Image size 128 Γ— 128
Optimizer Adam, lr = 1e-4
Scheduler ReduceLROnPlateau(mode='max', patience=5, factor=0.5) on val Dice
Loss 0.5 Β· BCE + 0.5 Β· Dice (CombinedLoss)
Augmentations RandomHorizontalFlip(p=0.5), RandomVerticalFlip(p=0.5), RandomRotation(15)
Epochs 50
Batch size 16
Random seed 42
Subset selection seed 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup)

The point of holding hyperparameters fixed is that the only intentional differences between this study and the original baselines are:

  1. Training set is deduplicated.
  2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
  3. Reproducible seed.

Metrics (standardized)

We accumulate TP / FP / FN / TN over each entire epoch and compute:

Metric Formula
iou (foreground) TP / (TP + FP + FN)
miou mean(foreground IoU, background IoU)
dice 2Β·TP / (2Β·TP + FP + FN)
pixel_acc (TP + TN) / total

This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.

Subset construction

subsets/make_subsets.py reads final_data_clean/train/images/, sorts filenames, shuffles once with random.Random(42), and writes:

  • subset_25.txt β€” first 25% of the shuffled list
  • subset_50.txt β€” first 50%
  • subset_100.txt β€” full list

Asserts 25 βŠ‚ 50 βŠ‚ 100. Plaintext, one filename per line β€” both the trainer and the dashboard read these as the single source of truth.

What we save per run

For every (model, share) pair:

  • checkpoints/{model}_{share}_best.pth β€” state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and output_is_prob flag for SegNet)
  • checkpoints/{model}_{share}_final.pth β€” state dict at epoch 50
  • logs/{model}_{share}.json β€” per-epoch JSON with train_* / val_* for {loss, dice, iou, miou, pixel_acc}, plus epoch_seconds, train_seconds, val_seconds, plus top-level wall-clock totals and ISO timestamps
  • logs/{model}_{share}.stdout.log β€” captured stdout from run_all.sh

Logs are written incrementally β€” safe to interrupt and inspect mid-training.


3 Β· Repo layout

experiments/clean_data_scaling_study/
β”œβ”€β”€ README.md                     # this file
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ dedupe_dataset.py             # builds final_data_clean/ + dedup_manifest.json
β”œβ”€β”€ subsets/
β”‚   β”œβ”€β”€ make_subsets.py           # builds nested subsets from the cleaned train list
β”‚   β”œβ”€β”€ subset_25.txt             # written by make_subsets.py
β”‚   β”œβ”€β”€ subset_50.txt
β”‚   └── subset_100.txt
β”œβ”€β”€ dataset.py                    # SubsetSolarPanelDataset
β”œβ”€β”€ metrics.py                    # global confusion-matrix mIoU/IoU/Dice/PixelAcc
β”œβ”€β”€ models.py                     # 4 builders: segnet/unet/segformer_b0/segformer_b5
β”œβ”€β”€ train.py                      # unified trainer (--model, --share)
β”œβ”€β”€ run_all.sh                    # 12 trainings sequentially
β”œβ”€β”€ checkpoints/                  # populated during training (24 files: 12 best + 12 final)
β”œβ”€β”€ logs/                         # populated during training (12 JSONs + 12 stdout logs)
└── dashboard/
    └── app.py                    # Streamlit dashboard

4 Β· How to run

# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt

# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py            # interactive
python dedupe_dataset.py --dry-run  # report only, no copy
python dedupe_dataset.py --force    # overwrite if final_data_clean/ exists

# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py

# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh                 # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0    # one model Γ— 3 shares
python train.py --model unet --share 25              # single run

# 4. Dashboard
streamlit run dashboard/app.py
# β†’ http://localhost:8501

5 Β· Reading the results

The dashboard's three tabs:

  1. Learning curves β€” switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
  2. Data share vs final β€” best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
  3. Inference β€” drop in any image, see the 4Γ—3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (mask / overlay / heatmap), and best vs final.

6 Β· Caveats

  • 128Γ—128 resolution. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
  • Single seed. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
  • Mask inconsistency for the 14 leaked pairs. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β€” so we did lose some training signal. The trade-off favors evaluation cleanliness.
  • Comparison to the leaked run. The previous experiments/data_scaling_study/ used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β€” only the trends in the leaked run can be cross-referenced with the trends here.