# Clean Data-Scaling Study

> **Why this experiment exists**: a previous data-scaling study ([experiments/data_scaling_study/](../data_scaling_study/)) was run on a dataset with a small but real **train↔val image leak**. This experiment redoes everything from scratch on a deduplicated dataset, with all four baseline architectures, no shortcuts.

---

## TL;DR

- **What we found**: **14 of 1,331 validation images (1.05%)** had byte-identical pixel content somewhere in the training set under different filenames. Two of those 14 val images had **two** byte-identical copies in train each, so removing the leak required dropping **16 train files**. Corresponding *masks* differed for every pair — same image annotated twice with slightly different labels.
- **Plus**: 41 within-train duplicate groups (41 redundant train files dropped, one canonical kept per group), and 1 within-val duplicate group (1 redundant val file dropped).
- **Final cleaned dataset**: train 5,325 → 5,268 (-57), val 1,331 → 1,330 (-1).
- **Why it matters**: small effect on absolute val numbers (~≤1% upper bound), zero effect on cross-architecture comparisons at the same data share, but biases the data-scaling slope (the 100% point is exposed to ~5× more leaked images than 25%).
- **What we did**:
  1. Built a deduplicated `final_data_clean/`: cross-leaks dropped from train (preserving val), within-train and within-val duplicates collapsed to one canonical copy each.
  2. Recomputed the nested 25 / 50 / 100% subsets from the cleaned train list (same `seed=42`, so 25 ⊂ 50 ⊂ 100 still holds).
  3. Retrained all 4 baselines (SegNet, U-Net, SegFormer-B0, SegFormer-B5) at all 3 data shares = **12 trainings from scratch**, no bootstrapping, both best- and final-epoch checkpoints saved.
  4. Same global confusion-matrix metric code across every model, so all numbers are directly comparable.

---

## 1 · The leakage, in detail

### How we found it

A simple integrity audit on `final_data/`:

```python
# train_files = sorted(*.jpg in final_data/train/images)
# val_files   = sorted(*.jpg in final_data/val/images)

# 1. Filename overlap?
train_names ∩ val_names  →  ∅       # 0 collisions, looked clean by name

# 2. Content overlap by md5?
train_hashes ∩ val_hashes  →  14 unique val images (16 train counterparts)
```

So the dataset *passed* a naïve filename-only check but **failed** the content-hash check. That's how this kind of leakage typically slips past curation.

### The 16 leaked train→val pairs

The script flagged 16 train files whose md5 matches a val image's md5. They map to 14 unique val images — two val images have *two* byte-identical train copies each (effectively triple-counted: 1 in val + 2 in train). All 16 train files are dropped:

| val image (kept) | train image (dropped) |
|---|---|
| `073102811.jpg` | `073110312.jpg` |
| `073131323.jpg` | `073156472.jpg` |
| `073134783.jpg` | `073135322.jpg` |
| `073135207.jpg` | `073211885.jpg` |
| `073140318.jpg` | `073160106.jpg` |
| `073223706.jpg` | `07313935.jpg` |
| `073237437.jpg` | `073164539.jpg` |
| `07325333.jpg` | `07350841.jpg` |
| `073255665.jpg` | `073131044.jpg` |
| `073264660.jpg` | `073248381.jpg` |
| `07331160.jpg` | `073106425.jpg` |
| `07373455.jpg` | `073108773.jpg` |
| *(plus 2 val images with 2 train copies each)* | |

Full list in `final_data_clean/dedup_manifest.json` under `category_A_cross_leak`.

### Same image, different masks

Curiously, the corresponding `_mask.png` files are **not** identical for any of the 14 pairs. The same source image was annotated twice with slightly different labels. So this is **image-content leakage, not label leakage**:

- During training, the network saw the pixel pattern under one annotation.
- During validation, it was scored against a different annotation of the same pixels.
- Net effect: the network has a head start on those 14 images (it has memorized the visual features) but the val mask is held-out, so accuracy on them is a mix of memorization and generalization.

### Cross-share exposure

Because the 14 leaked train images are mixed throughout the train set, each data share saw a different number of them:

| Data share | Leaked train copies seen |
|---|---:|
| 25% | 3 / 14 |
| 50% | 7 / 14 |
| 100% | 14 / 14 |

So in the **leaked** study the 100% model had ~5× more "seen-during-training" val examples than the 25% model. This biases the data-scaling slope upward. Removing the leakage gives a cleaner read on what data volume actually buys you.

### Within-set duplication (extra cleanup)

The audit also found:

- **Within train**: 41 hash-groups, 41 redundant files dropped (one canonical kept per group). These don't cause leakage but inflate the effective dataset size and over-weight the duplicate images during training.
- **Within val**: 1 hash-group, 1 redundant file dropped. This over-weights one image during evaluation otherwise.

We deduplicated all three categories. Full manifest in `final_data_clean/dedup_manifest.json`.

---

## 2 · Methodology

### Dataset

**Source**: `final_data/` (the original, leaky dataset — left untouched as historical record).
**Cleaned copy**: `final_data_clean/`, built by [dedupe_dataset.py](dedupe_dataset.py).

Three categories of removal:

| Category | Side dropped | Rationale |
|---|---|---|
| **A — cross-leak** (val image's bytes appear in train) | drop from **train** | Preserves val set integrity. Standard practice — the val set is sacred. |
| **B — within-train dupes** | keep first (alphabetical), drop rest | Keeps one canonical copy per unique image. |
| **C — within-val dupes** | keep first, drop rest | Same. |

For every dropped file the `dedup_manifest.json` records: filename, side (train / val), reason (`cross_leak` / `train_dup` / `val_dup`), and the kept alias.

After cleaning, sanity check confirms `train_hashes ∩ val_hashes = ∅`.

### Architectures

All 4 baselines from [pv_panel_models/](../../pv_panel_models/), trained from scratch on the cleaned data:

| ID | Model | Source class | Notes |
|---|---|---|---|
| `segnet` | SegNet (CNN) | [`pv_panel_models/cnn_model/cnn_segmenter.py`](../../pv_panel_models/cnn_model/cnn_segmenter.py) | encoder/decoder w/ MaxPool indices for unpooling. **forward applies sigmoid**. |
| `unet` | U-Net | [`pv_panel_models/unet_model/unet_model.py`](../../pv_panel_models/unet_model/unet_model.py) | classic skip-concatenation. |
| `segformer_b0` | SegFormer mit-b0 | [`pv_panel_models/vit_model/segformer_model.py`](../../pv_panel_models/vit_model/segformer_model.py) | HuggingFace small. |
| `segformer_b5` | SegFormer mit-b5 | [`pv_panel_models/segformer_b5_model/segformer_model.py`](../../pv_panel_models/segformer_b5_model/segformer_model.py) | HuggingFace large. |

### Hyperparameters (identical across models, identical to original baselines)

| | |
|---|---|
| Image size | 128 × 128 |
| Optimizer | Adam, lr = 1e-4 |
| Scheduler | `ReduceLROnPlateau(mode='max', patience=5, factor=0.5)` on val Dice |
| Loss | `0.5 · BCE + 0.5 · Dice` (`CombinedLoss`) |
| Augmentations | `RandomHorizontalFlip(p=0.5)`, `RandomVerticalFlip(p=0.5)`, `RandomRotation(15)` |
| Epochs | 50 |
| Batch size | 16 |
| Random seed | 42 |
| Subset selection seed | 42 (same as the leaky run, so the 25/50/100% nesting structure is preserved across studies modulo the cleanup) |

The point of holding hyperparameters fixed is that the **only intentional differences** between this study and the original baselines are:
1. Training set is deduplicated.
2. Metrics use a global confusion matrix (instead of the per-batch averaging the originals did).
3. Reproducible seed.

### Metrics (standardized)

We accumulate TP / FP / FN / TN over each entire epoch and compute:

| Metric | Formula |
|---|---|
| `iou` (foreground) | `TP / (TP + FP + FN)` |
| `miou` | `mean(foreground IoU, background IoU)` |
| `dice` | `2·TP / (2·TP + FP + FN)` |
| `pixel_acc` | `(TP + TN) / total` |

This matches PASCAL/Cityscapes-style mIoU reporting. The per-batch averaging used in the original baselines slightly differs (especially when batches are imbalanced); here every model is evaluated identically.

### Subset construction

[subsets/make_subsets.py](subsets/make_subsets.py) reads `final_data_clean/train/images/`, sorts filenames, shuffles once with `random.Random(42)`, and writes:

- `subset_25.txt` — first 25% of the shuffled list
- `subset_50.txt` — first 50%
- `subset_100.txt` — full list

Asserts `25 ⊂ 50 ⊂ 100`. Plaintext, one filename per line — both the trainer and the dashboard read these as the single source of truth.

### What we save per run

For every (model, share) pair:

- `checkpoints/{model}_{share}_best.pth` — state dict at the highest val Dice across all 50 epochs (plus epoch number, val metrics, model name, share, and `output_is_prob` flag for SegNet)
- `checkpoints/{model}_{share}_final.pth` — state dict at epoch 50
- `logs/{model}_{share}.json` — per-epoch JSON with `train_*` / `val_*` for `{loss, dice, iou, miou, pixel_acc}`, plus `epoch_seconds`, `train_seconds`, `val_seconds`, plus top-level wall-clock totals and ISO timestamps
- `logs/{model}_{share}.stdout.log` — captured stdout from `run_all.sh`

Logs are written incrementally — safe to interrupt and inspect mid-training.

---

## 3 · Repo layout

```
experiments/clean_data_scaling_study/
├── README.md                     # this file
├── requirements.txt
├── dedupe_dataset.py             # builds final_data_clean/ + dedup_manifest.json
├── subsets/
│   ├── make_subsets.py           # builds nested subsets from the cleaned train list
│   ├── subset_25.txt             # written by make_subsets.py
│   ├── subset_50.txt
│   └── subset_100.txt
├── dataset.py                    # SubsetSolarPanelDataset
├── metrics.py                    # global confusion-matrix mIoU/IoU/Dice/PixelAcc
├── models.py                     # 4 builders: segnet/unet/segformer_b0/segformer_b5
├── train.py                      # unified trainer (--model, --share)
├── run_all.sh                    # 12 trainings sequentially
├── checkpoints/                  # populated during training (24 files: 12 best + 12 final)
├── logs/                         # populated during training (12 JSONs + 12 stdout logs)
└── dashboard/
    └── app.py                    # Streamlit dashboard
```

---

## 4 · How to run

```bash
# 0. Install (system python on this machine; adjust if you use a venv)
pip install --user --break-system-packages -r requirements.txt

# 1. Build the deduplicated dataset (writes ../../final_data_clean/)
python dedupe_dataset.py            # interactive
python dedupe_dataset.py --dry-run  # report only, no copy
python dedupe_dataset.py --force    # overwrite if final_data_clean/ exists

# 2. Build the nested subsets (writes subsets/subset_*.txt)
python subsets/make_subsets.py

# 3. Train. Each run takes 5–75 min on a single GPU depending on model + share.
PYTHON=/usr/bin/python3 ./run_all.sh                 # all 12 runs (~6 hours)
PYTHON=/usr/bin/python3 ./run_all.sh segformer_b0    # one model × 3 shares
python train.py --model unet --share 25              # single run

# 4. Dashboard
streamlit run dashboard/app.py
# → http://localhost:8501
```

---

## 5 · Reading the results

The dashboard's three tabs:

1. **Learning curves** — switchable metric (Dice / mIoU / IoU / PixelAcc / Loss), train+val toggle, one chart per architecture with the three data shares overlaid.
2. **Data share vs final** — best- vs final-epoch toggle, four charts (mIoU, Dice, IoU, PixelAcc) by data share with the four architectures as separate lines / bars. Plus per-run wall-clock and seconds-per-epoch breakdown.
3. **Inference** — drop in any image, see the 4×3 = 12-panel grid of predictions, side-by-side. Toggle threshold, view (`mask` / `overlay` / `heatmap`), and best vs final.

---

## 6 · Caveats

- **128×128 resolution**. SegFormer architectures generally benefit from higher resolution; the comparison is fair across architectures here, but absolute SegFormer numbers would likely improve at 256+ input sizes.
- **Single seed**. Each (model, share) is one training run. Multiple seeds would tighten error bars; we did not do that to keep the GPU budget reasonable.
- **Mask inconsistency for the 14 leaked pairs**. We dropped the train copy and kept the val copy, but the val mask was annotated separately from the dropped train mask — so we *did* lose some training signal. The trade-off favors evaluation cleanliness.
- **Comparison to the leaked run**. The previous [experiments/data_scaling_study/](../data_scaling_study/) used per-batch averaged metrics and 2 architectures (U-Net + SegFormer-B0); this run uses global metrics and 4 architectures. So absolute numbers are not directly comparable — only the *trends* in the leaked run can be cross-referenced with the trends here.