# Clean Data-Scaling Study > **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **train↔val image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts. --- ## TL;DR - **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair — same image annotated twice with slightly different labels. - **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped). - **Final cleaned dataset**: train 5,325 → 5,268 (-57), val 1,331 → 1,330 (-1). - **Why it matters**: small effect on absolute val numbers (~≤1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5× more leaked images than 25%). - **What we did**: 1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each. 2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 ⊂ 50 ⊂ 100 still holds). 3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved. 4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable. --- ## 1 · The leakage, in detail ### How we found it A simple integrity audit on `final_data/`: ```python # train_files = sorted(*.jpg in final_data/train/images) # val_files = sorted(*.jpg in final_data/val/images) # 1. Filename overlap? train_names ∩ val_names → ∅ # 0 collisions, looked clean by name # 2. Content overlap by md5? train_hashes ∩ val_hashes → 14 unique val images (16 train counterparts) ``` So the dataset *passed* a naïve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation. ### The 16 leaked train→val pairs The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images — two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped: | val image (kept) | train image (dropped) | |---|---| | `073102811.jpg` | `073110312.jpg` | | `073131323.jpg` | `073156472.jpg` | | `073134783.jpg` | `073135322.jpg` | | `073135207.jpg` | `073211885.jpg` | | `073140318.jpg` | `073160106.jpg` | | `073223706.jpg` | `07313935.jpg` | | `073237437.jpg` | `073164539.jpg` | | `07325333.jpg` | `07350841.jpg` | | `073255665.jpg` | `073131044.jpg` | | `073264660.jpg` | `073248381.jpg` | | `07331160.jpg` | `073106425.jpg` | | `07373455.jpg` | `073108773.jpg` | | *(plus 2 val images with 2 train copies each)* | | Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`. ### Same image, different masks Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**: - During training, the network saw the pixel pattern under one annotation. - During validation, it was scored against a different annotation of the same pixels. - Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization. ### Cross-share exposure Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them: | Data share | Leaked train copies seen | |---|---:| | 25% | 3 / 14 | | 50% | 7 / 14 | | 100% | 14 / 14 | So in the **leaked** study the 100% model had ~5× more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you. ### Within-set duplication (extra cleanup) The audit also found: - **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training. - **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise. We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`. --- ## 2 · Methodology ### Dataset **Source**: `final_data/` (the original, leaky dataset — left untouched as historical record). **Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py). Three categories of removal: | Category | Side dropped | Rationale | |---|---|---| | **A — cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice — the val set is sacred. | | **B — within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. | | **C — within-val dupes** | keep first, drop rest | Same. | For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias. After cleaning, sanity check confirms `train_hashes ∩ val_hashes = ∅`. ### Architectures All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data: | ID | Model | Source class | Notes | |---|---|---|---| | `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. | | `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. | | `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. | | `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. | ### Hyperparameters (identical across models, identical to original baselines) | | | |---|---| | Image size | 128 × 128 | | Optimizer | Adam, lr = 1e-4 | | Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice | | Loss | `0.5 · BCE + 0.5 · Dice` (`CombinedLoss`) | | Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` | | Epochs | 50 | | Batch size | 16 | | Random seed | 42 | | Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) | The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are: 1. Training set is deduplicated. 2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did). 3. Reproducible seed. ### Metrics (standardized) We accumulate TP / FP / FN / TN over each entire epoch and compute: | Metric | Formula | |---|---| | `iou` (foreground) | `TP / (TP + FP + FN)` | | `miou` | `mean(foreground IoU, background IoU)` | | `dice` | `2·TP / (2·TP + FP + FN)` | | `pixel_acc` | `(TP + TN) / total` | This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically. ### Subset construction [subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes: - `subset_25.txt` — first 25% of the shuffled list - `subset_50.txt` — first 50% - `subset_100.txt` — full list Asserts `25 ⊂ 50 ⊂ 100`. Plaintext, one filename per line — both the trainer and the dashboard read these as the single source of truth. ### What we save per run For every (model, share) pair: - `checkpoints/{model}_{share}_best.pth` — state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet) - `checkpoints/{model}_{share}_final.pth` — state dict at epoch 50 - `logs/{model}_{share}.json` — per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps - `logs/{model}_{share}.stdout.log` — captured stdout from `run_all.sh` Logs are written incrementally — safe to interrupt and inspect mid-training. --- ## 3 · Repo layout ``` experiments/clean_data_scaling_study/ ├── README.md # this file ├── requirements.txt ├── dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json ├── subsets/ │ ├── make_subsets.py # builds nested subsets from the cleaned train list │ ├── subset_25.txt # written by make_subsets.py │ ├── subset_50.txt │ └── subset_100.txt ├── dataset.py # SubsetSolarPanelDataset ├── metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc ├── models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5 ├── train.py # unified trainer (--model, --share) ├── run_all.sh # 12 trainings sequentially ├── checkpoints/ # populated during training (24 files: 12 best + 12 final) ├── logs/ # populated during training (12 JSONs + 12 stdout logs) └── dashboard/ └── app.py # Streamlit dashboard ``` --- ## 4 · How to run ```bash # 0. Install (system python on this machine; adjust if you use a venv) pip install --user --break-system-packages -r requirements.txt # 1. Build the deduplicated dataset (writes ../../final_data_clean/) python dedupe_dataset.py # interactive python dedupe_dataset.py --dry-run # report only, no copy python dedupe_dataset.py --force # overwrite if final_data_clean/ exists # 2. Build the nested subsets (writes subsets/subset_*.txt) python subsets/make_subsets.py # 3. Train. Each run takes 5–75 min on a single GPU depending on model + share. PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours) PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model × 3 shares python train.py --model unet --share 25 # single run # 4. Dashboard streamlit run dashboard/app.py # → http://localhost:8501 ``` --- ## 5 · Reading the results The dashboard's three tabs: 1. **Learning curves** — switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid. 2. **Data share vs final** — best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown. 3. **Inference** — drop in any image, see the 4×3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final. --- ## 6 · Caveats - **128×128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes. - **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable. - **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask — so we *did* lose some training signal. The trade-off favors evaluation cleanliness. - **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable — only the *trends* in the leaked run can be cross-referenced with the trends here.