Spaces:
Sleeping
A newer version of the Streamlit SDK is available: 1.58.0
Clean Data-Scaling Study
Why this experiment exists: a previous data-scaling study (experiments/data_scaling_study/) was run on a dataset with a small but real trainβval image leak. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.
TL;DR
- What we found: 14 of 1,331 validation images (1.05%) had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had two byte-identical copies in train each, so removing the leak required dropping 16 train files. Corresponding masks differed for every pair β same image annotated twice with slightly different labels.
- Plus: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
- Final cleaned dataset: train 5,325 β 5,268 (-57), val 1,331 β 1,330 (-1).
- Why it matters: small effect on absolute val numbers (~β€1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ more leaked images than 25%).
- What we did:
- Built a deduplicated
final_data_clean/: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each. - Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same
seed=42, so 25 β 50 β 100 still holds). - Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = 12 trainings from scratch, no bootstrapping, both best- and final-epoch checkpoints saved.
- Same global confusion-matrix metric code across every model, so all numbers are directly comparable.
- Built a deduplicated
1 Β· The leakage, in detail
How we found it
A simple integrity audit on final_data/:
# train_files = sorted(*.jpg in final_data/train/images)
# val_files = sorted(*.jpg in final_data/val/images)
# 1. Filename overlap?
train_names β© val_names β β
# 0 collisions, looked clean by name
# 2. Content overlap by md5?
train_hashes β© val_hashes β 14 unique val images (16 train counterparts)
So the dataset passed a naΓ―ve filename-only check but failed the content-hash check. That's how this kind of leakage typically slips past curation.
The 16 leaked trainβval pairs
The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β two val images have two byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:
| val image (kept) | train image (dropped) |
|---|---|
073102811.jpg |
073110312.jpg |
073131323.jpg |
073156472.jpg |
073134783.jpg |
073135322.jpg |
073135207.jpg |
073211885.jpg |
073140318.jpg |
073160106.jpg |
073223706.jpg |
07313935.jpg |
073237437.jpg |
073164539.jpg |
07325333.jpg |
07350841.jpg |
073255665.jpg |
073131044.jpg |
073264660.jpg |
073248381.jpg |
07331160.jpg |
073106425.jpg |
07373455.jpg |
073108773.jpg |
| (plus 2 val images with 2 train copies each) |
Full list in final_data_clean/dedup_manifest.json under category_A_cross_leak.
Same image, different masks
Curiously, the corresponding _mask.png files are not identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is image-content leakage, not label leakage:
- During training, the network saw the pixel pattern under one annotation.
- During validation, it was scored against a different annotation of the same pixels.
- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.
Cross-share exposure
Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:
| Data share | Leaked train copies seen |
|---|---|
| 25% | 3 / 14 |
| 50% | 7 / 14 |
| 100% | 14 / 14 |
So in the leaked study the 100% model had ~5Γ more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.
Within-set duplication (extra cleanup)
The audit also found:
- Within train: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
- Within val: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.
We deduplicated all three categories. Full manifest in final_data_clean/dedup_manifest.json.
2 Β· Methodology
Dataset
Source: final_data/ (the original, leaky dataset β left untouched as historical record).
Cleaned copy: final_data_clean/, built by dedupe_dataset.py.
Three categories of removal:
| Category | Side dropped | Rationale |
|---|---|---|
| A β cross-leak (val image's bytes appear in train) | drop from train | Preserves val set integrity. Standard practice β the val set is sacred. |
| B β within-train dupes | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. |
| C β within-val dupes | keep first, drop rest | Same. |
For every dropped file the dedup_manifest.json records: filename, side (train / val), reason (cross_leak / train_dup / val_dup), and the kept alias.
After cleaning, sanity check confirms train_hashes β© val_hashes = β
.
Architectures
All 4 baselines from pv_panel_models/, trained from scratch on the cleaned data:
| ID | Model | Source class | Notes |
|---|---|---|---|
segnet |
SegNet (CNN) | pv_panel_models/cnn_model/cnn_segmenter.py |
encoder/decoder w/ MaxPool indices for unpooling. forward applies sigmoid. |
unet |
U-Net | pv_panel_models/unet_model/unet_model.py |
classic skip-concatenation. |
segformer_b0 |
SegFormer mit-b0 | pv_panel_models/vit_model/segformer_model.py |
HuggingFace small. |
segformer_b5 |
SegFormer mit-b5 | pv_panel_models/segformer_b5_model/segformer_model.py |
HuggingFace large. |
Hyperparameters (identical across models, identical to original baselines)
| Image size | 128 Γ 128 |
| Optimizer | Adam, lr = 1e-4 |
| Scheduler | ReduceLROnPlateau(mode='max', patience=5, factor=0.5) on val Dice |
| Loss | 0.5 Β· BCE + 0.5 Β· Dice (CombinedLoss) |
| Augmentations | RandomHorizontalFlip(p=0.5), RandomVerticalFlip(p=0.5), RandomRotation(15) |
| Epochs | 50 |
| Batch size | 16 |
| Random seed | 42 |
| Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) |
The point of holding hyperparameters fixed is that the only intentional differences between this study and the original baselines are:
- Training set is deduplicated.
- Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
- Reproducible seed.
Metrics (standardized)
We accumulate TP / FP / FN / TN over each entire epoch and compute:
| Metric | Formula |
|---|---|
iou (foreground) |
TP / (TP + FP + FN) |
miou |
mean(foreground IoU, background IoU) |
dice |
2Β·TP / (2Β·TP + FP + FN) |
pixel_acc |
(TP + TN) / total |
This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.
Subset construction
subsets/make_subsets.py reads final_data_clean/train/images/, sorts filenames, shuffles once with random.Random(42), and writes:
subset_25.txtβ first 25% of the shuffled listsubset_50.txtβ first 50%subset_100.txtβ full list
Asserts 25 β 50 β 100. Plaintext, one filename per line β both the trainer and the dashboard read these as the single source of truth.
What we save per run
For every (model, share) pair:
checkpoints/{model}_{share}_best.pthβ state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, andoutput_is_probflag for SegNet)checkpoints/{model}_{share}_final.pthβ state dict at epoch 50logs/{model}_{share}.jsonβ per-epoch JSON withtrain_*/val_*for{loss, dice, iou, miou, pixel_acc}, plusepoch_seconds,train_seconds,val_seconds, plus top-level wall-clock totals and ISO timestampslogs/{model}_{share}.stdout.logβ captured stdout fromrun_all.sh
Logs are written incrementally β safe to interrupt and inspect mid-training.
3 Β· Repo layout
experiments/clean_data_scaling_study/
βββ README.md # this file
βββ requirements.txt
βββ dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json
βββ subsets/
β βββ make_subsets.py # builds nested subsets from the cleaned train list
β βββ subset_25.txt # written by make_subsets.py
β βββ subset_50.txt
β βββ subset_100.txt
βββ dataset.py # SubsetSolarPanelDataset
βββ metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc
βββ models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5
βββ train.py # unified trainer (--model, --share)
βββ run_all.sh # 12 trainings sequentially
βββ checkpoints/ # populated during training (24 files: 12 best + 12 final)
βββ logs/ # populated during training (12 JSONs + 12 stdout logs)
βββ dashboard/
βββ app.py # Streamlit dashboard
4 Β· How to run
# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt
# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py # interactive
python dedupe_dataset.py --dry-run # report only, no copy
python dedupe_dataset.py --force # overwrite if final_data_clean/ exists
# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py
# 3. Train. Each run takes 5β75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model Γ 3 shares
python train.py --model unet --share 25 # single run
# 4. Dashboard
streamlit run dashboard/app.py
# β http://localhost:8501
5 Β· Reading the results
The dashboard's three tabs:
- Learning curves β switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
- Data share vs final β best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
- Inference β drop in any image, see the 4Γ3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (
mask/overlay/heatmap), and best vs final.
6 Β· Caveats
- 128Γ128 resolution. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
- Single seed. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
- Mask inconsistency for the 14 leaked pairs. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β so we did lose some training signal. The trade-off favors evaluation cleanliness.
- Comparison to the leaked run. The previous experiments/data_scaling_study/ used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β only the trends in the leaked run can be cross-referenced with the trends here.