# Experiment Log — Agentic Thyroid ResNet-18

Chronological, decision-by-decision record for reproducibility and journal review.

## Provenance

- **Experiment date (UTC):** 2026-06-05
- **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification`
  - Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d`
  - Loaded **directly from the Train/Valid/Test folder structure** (NOT the
    datasets-viewer flattened `train` config, which merges all 5,000 rows),
    so the predefined splits are respected.
- **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
- **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
  scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`).
- **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`,
  cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers.
- **Positive class:** Malignant (label 1).
- **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard
  Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset
  `Johnyquest7/Trakio_agentic_thyroid_dataset`.

## Exact split usage

| Split | n | Use |
|-------|--:|-----|
| Train | 3,500 | **Training only.** |
| Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** |
| Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. |

## Class distribution

| Split | Benign | Malignant | Malignant % | Ratio (M:B) |
|-------|-------:|----------:|------------:|------------:|
| Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 |
| Valid |   125 |   375 | 75.0% | 3.00 : 1 |
| Test  |   269 |   731 | 73.1% | 2.72 : 1 |

Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224×224 RGB
PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0
label conflicts**; per-split mean intensity ≈ 81.4 (std ≈ 19.4) — no distribution
shift. Conclusion: **no detectable leakage; splits are clean and separate.**

## Literature-informed augmentation rationale

Augmentations were restricted to medically plausible B-mode ultrasound transforms
(MediAug arXiv:2504.18983 + thyroid-US practice):
- **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation
  (≤10°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast
  (±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong`
  ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
- **Explicitly avoided:** vertical flip (US depth axis is physically meaningful),
  large rotation/shear (distorts taller-than-wide / margin morphology — TI-RADS
  malignancy cues), aggressive crop (<0.8, can remove the nodule), and any
  color/HSV jitter (images are grayscale).

Ablation result (val AUROC): `medical_default` **0.9712–0.9756** > `flip_only`
**0.9637** > `medical_strong` **0.9609**. The literature-default policy won.

## Model variants tried

- `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256→224 preprocessing).
- `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) — **selected**.
- `timm:resnet18.a2_in1k` (A2 recipe).
- Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`).

## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)

Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`,
full fine-tune, BCE, AdamW, cosine, ≤40 epochs, early-stop(8). All runs logged to
Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0).

| Rank | Run | Change vs center | Val AUROC | Best epoch |
|-----:|-----|------------------|----------:|-----------:|
| 1 | **c12_loss_focal** | **focal γ=1.0, imbalance=none** | **0.9756** | 6 |
| 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 |
| 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 |
| 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 |
| 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 |
| 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 |
| 6 | c00_center_a1 | center config | 0.9712 | 6 |
| 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 |
| 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 |
| 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 |
| 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 |
| 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 |
| 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 |
| 14 | c08_aug_strong | strong aug | 0.9609 | 8 |

Findings: (1) For this mild (~70/30) imbalance, **focal loss (γ=1.0) and no extra
reweighting beat class-weighted BCE and weighted sampling** — heavy reweighting
slightly hurt, consistent with the literature. (2) `medical_default` augmentation
is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close
(a1 ≈ torchvision ≈ a2). Full per-run details: `results/tables/sweep_leaderboard.json`.

No excessive trial count was used (14 one-factor trials) to avoid overfitting the
500-image validation set.

## Selected run

**c12_loss_focal** — `timm:resnet18.a1_in1k`, focal loss (γ=1.0, α=0.5),
imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full
fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected
**purely on validation AUROC**, before any test access. Config:
`configs/final_config.yaml`; weights: `final_model.pt`.

## Calibration decision

Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL)
gave **T = 0.5646**. Validation ECE 0.0833 → **0.0308**, Brier 0.0592 → 0.0525,
AUROC unchanged (0.9756; temperature scaling is monotonic ⇒ discrimination
preserved). **Decision: use calibrated probabilities** for thresholding and test
reporting. Parameters + before/after metrics: `configs/calibration.json`.
Reliability diagrams: `results/figures/{valid,test}_calibration.png`.

## Threshold selection decision

On the validation set, using calibrated probabilities, the primary threshold was
the **highest-specificity threshold achieving sensitivity ≥ 0.95** (sensitivity-
prioritized, clinically motivated). Target was achievable.

- **Locked threshold = 0.7113** → validation sensitivity **0.952**, specificity **0.896**.
- Secondary reference (Youden's J): coincided at 0.7113 here.

Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`.

## Final locked threshold

**0.7113139** (on calibrated malignancy probability).

## Final test results with 95% CIs

Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified
bootstrap, 2000 resamples, seed=42.

| Metric | Point | 95% CI |
|--------|------:|:------:|
| AUROC | 0.9371 | [0.9202, 0.9528] |
| Sensitivity | 0.9042 | [0.8824, 0.9248] |
| Specificity | 0.7955 | [0.7435, 0.8439] |
| PPV | 0.9232 | [0.9054, 0.9401] |
| NPV | 0.7535 | [0.7123, 0.7979] |
| Accuracy | 0.8750 | [0.8540, 0.8950] |
| F1 | 0.9136 | [0.8991, 0.9278] |
| Brier | 0.0823 | — |
| ECE | 0.0314 | — |

Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions:
`results/{valid,test}_predictions.csv`.

## Failed / weaker runs

No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation
(0.9609) and `flip_only` (0.9637) — both under the `medical_default` baseline,
confirming the augmentation policy choice. Earlier trackio smoke runs
(`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are
infrastructure-validation runs, not experiments. Early Trackio Space creation
produced transient 401 `/volumes` warnings until a persistent `dataset_id`
(`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter.

## Limitations

- Single-source dataset; cropped-ROI inputs; mild class imbalance.
- The ≥0.95-sensitivity operating point set on validation yielded **0.904**
  sensitivity on test — the operating point does not transfer perfectly; ~10% of
  malignant nodules are missed at the locked threshold. Local threshold
  re-calibration is advisable before any use.
- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the
  available signal but cannot exclude same-patient/near-duplicate leakage if such
  structure exists upstream in TN5000.

## External validation — NOT yet performed

No external/independent dataset has been evaluated. `evaluate_external.py` is
provided to run the locked model (same preprocessing, calibration T, and locked
threshold) on a future external set (folder or CSV format). External, ideally
prospective and multi-site, validation is **required** before any clinical use.

## Test-set integrity statement

> The Test split was evaluated **exactly once**, and **only after** the model was
> selected (by validation AUROC), calibrated (temperature scaling on validation),
> and the decision threshold was locked (on validation). No hyperparameter,
> calibration, or threshold decision used the test set.