Spaces:
Sleeping
Sleeping
| # Clean Data-Scaling Study | |
| > **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **trainβval image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts. | |
| --- | |
| ## TL;DR | |
| - **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair β same image annotated twice with slightly different labels. | |
| - **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped). | |
| - **Final cleaned dataset**: train 5,325 β 5,268 (-57), val 1,331 β 1,330 (-1). | |
| - **Why it matters**: small effect on absolute val numbers (~β€1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5Γ more leaked images than 25%). | |
| - **What we did**: | |
| 1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each. | |
| 2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 β 50 β 100 still holds). | |
| 3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved. | |
| 4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable. | |
| --- | |
| ## 1 Β· The leakage, in detail | |
| ### How we found it | |
| A simple integrity audit on `final_data/`: | |
| ```python | |
| # train_files = sorted(*.jpg in final_data/train/images) | |
| # val_files = sorted(*.jpg in final_data/val/images) | |
| # 1. Filename overlap? | |
| train_names β© val_names β β # 0 collisions, looked clean by name | |
| # 2. Content overlap by md5? | |
| train_hashes β© val_hashes β 14 unique val images (16 train counterparts) | |
| ``` | |
| So the dataset *passed* a naΓ―ve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation. | |
| ### The 16 leaked trainβval pairs | |
| The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images β two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped: | |
| | val image (kept) | train image (dropped) | | |
| |---|---| | |
| | `073102811.jpg` | `073110312.jpg` | | |
| | `073131323.jpg` | `073156472.jpg` | | |
| | `073134783.jpg` | `073135322.jpg` | | |
| | `073135207.jpg` | `073211885.jpg` | | |
| | `073140318.jpg` | `073160106.jpg` | | |
| | `073223706.jpg` | `07313935.jpg` | | |
| | `073237437.jpg` | `073164539.jpg` | | |
| | `07325333.jpg` | `07350841.jpg` | | |
| | `073255665.jpg` | `073131044.jpg` | | |
| | `073264660.jpg` | `073248381.jpg` | | |
| | `07331160.jpg` | `073106425.jpg` | | |
| | `07373455.jpg` | `073108773.jpg` | | |
| | *(plus 2 val images with 2 train copies each)* | | | |
| Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`. | |
| ### Same image, different masks | |
| Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**: | |
| - During training, the network saw the pixel pattern under one annotation. | |
| - During validation, it was scored against a different annotation of the same pixels. | |
| - Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization. | |
| ### Cross-share exposure | |
| Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them: | |
| | Data share | Leaked train copies seen | | |
| |---|---:| | |
| | 25% | 3 / 14 | | |
| | 50% | 7 / 14 | | |
| | 100% | 14 / 14 | | |
| So in the **leaked** study the 100% model had ~5Γ more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you. | |
| ### Within-set duplication (extra cleanup) | |
| The audit also found: | |
| - **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training. | |
| - **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise. | |
| We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`. | |
| --- | |
| ## 2 Β· Methodology | |
| ### Dataset | |
| **Source**: `final_data/` (the original, leaky dataset β left untouched as historical record). | |
| **Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py). | |
| Three categories of removal: | |
| | Category | Side dropped | Rationale | | |
| |---|---|---| | |
| | **A β cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice β the val set is sacred. | | |
| | **B β within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. | | |
| | **C β within-val dupes** | keep first, drop rest | Same. | | |
| For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias. | |
| After cleaning, sanity check confirms `train_hashes β© val_hashes = β `. | |
| ### Architectures | |
| All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data: | |
| | ID | Model | Source class | Notes | | |
| |---|---|---|---| | |
| | `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. | | |
| | `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. | | |
| | `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. | | |
| | `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. | | |
| ### Hyperparameters (identical across models, identical to original baselines) | |
| | | | | |
| |---|---| | |
| | Image size | 128 Γ 128 | | |
| | Optimizer | Adam, lr = 1e-4 | | |
| | Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice | | |
| | Loss | `0.5 Β· BCE + 0.5 Β· Dice` (`CombinedLoss`) | | |
| | Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` | | |
| | Epochs | 50 | | |
| | Batch size | 16 | | |
| | Random seed | 42 | | |
| | Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) | | |
| The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are: | |
| 1. Training set is deduplicated. | |
| 2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did). | |
| 3. Reproducible seed. | |
| ### Metrics (standardized) | |
| We accumulate TP / FP / FN / TN over each entire epoch and compute: | |
| | Metric | Formula | | |
| |---|---| | |
| | `iou` (foreground) | `TP / (TP + FP + FN)` | | |
| | `miou` | `mean(foreground IoU, background IoU)` | | |
| | `dice` | `2Β·TP / (2Β·TP + FP + FN)` | | |
| | `pixel_acc` | `(TP + TN) / total` | | |
| This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically. | |
| ### Subset construction | |
| [subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes: | |
| - `subset_25.txt` β first 25% of the shuffled list | |
| - `subset_50.txt` β first 50% | |
| - `subset_100.txt` β full list | |
| Asserts `25 β 50 β 100`. Plaintext, one filename per line β both the trainer and the dashboard read these as the single source of truth. | |
| ### What we save per run | |
| For every (model, share) pair: | |
| - `checkpoints/{model}_{share}_best.pth` β state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet) | |
| - `checkpoints/{model}_{share}_final.pth` β state dict at epoch 50 | |
| - `logs/{model}_{share}.json` β per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps | |
| - `logs/{model}_{share}.stdout.log` β captured stdout from `run_all.sh` | |
| Logs are written incrementally β safe to interrupt and inspect mid-training. | |
| --- | |
| ## 3 Β· Repo layout | |
| ``` | |
| experiments/clean_data_scaling_study/ | |
| βββ README.md # this file | |
| βββ requirements.txt | |
| βββ dedupe_dataset.py # builds final_data_clean/ + dedup_manifest.json | |
| βββ subsets/ | |
| β βββ make_subsets.py # builds nested subsets from the cleaned train list | |
| β βββ subset_25.txt # written by make_subsets.py | |
| β βββ subset_50.txt | |
| β βββ subset_100.txt | |
| βββ dataset.py # SubsetSolarPanelDataset | |
| βββ metrics.py # global confusion-matrix mIoU/IoU/Dice/PixelAcc | |
| βββ models.py # 4 builders: segnet/unet/segformer_b0/segformer_b5 | |
| βββ train.py # unified trainer (--model, --share) | |
| βββ run_all.sh # 12 trainings sequentially | |
| βββ checkpoints/ # populated during training (24 files: 12 best + 12 final) | |
| βββ logs/ # populated during training (12 JSONs + 12 stdout logs) | |
| βββ dashboard/ | |
| βββ app.py # Streamlit dashboard | |
| ``` | |
| --- | |
| ## 4 Β· How to run | |
| ```bash | |
| # 0. Install (system python on this machine; adjust if you use a venv) | |
| pip install --user --break-system-packages -r requirements.txt | |
| # 1. Build the deduplicated dataset (writes ../../final_data_clean/) | |
| python dedupe_dataset.py # interactive | |
| python dedupe_dataset.py --dry-run # report only, no copy | |
| python dedupe_dataset.py --force # overwrite if final_data_clean/ exists | |
| # 2. Build the nested subsets (writes subsets/subset_*.txt) | |
| python subsets/make_subsets.py | |
| # 3. Train. Each run takes 5β75 min on a single GPU depending on model + share. | |
| PYTHON=/usr/bin/python3 ./run_all.sh # all 12 runs (~6 hours) | |
| PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0 # one model Γ 3 shares | |
| python train.py --model unet --share 25 # single run | |
| # 4. Dashboard | |
| streamlit run dashboard/app.py | |
| # β http://localhost:8501 | |
| ``` | |
| --- | |
| ## 5 Β· Reading the results | |
| The dashboard's three tabs: | |
| 1. **Learning curves** β switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid. | |
| 2. **Data share vs final** β best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown. | |
| 3. **Inference** β drop in any image, see the 4Γ3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final. | |
| --- | |
| ## 6 Β· Caveats | |
| - **128Γ128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes. | |
| - **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable. | |
| - **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask β so we *did* lose some training signal. The trade-off favors evaluation cleanliness. | |
| - **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable β only the *trends* in the leaked run can be cross-referenced with the trends here. | |