# Experiment Log — Agentic Thyroid ResNet-18 Chronological, decision-by-decision record for reproducibility and journal review. ## Provenance - **Experiment date (UTC):** 2026-06-05 - **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification` - Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d` - Loaded **directly from the Train/Valid/Test folder structure** (NOT the datasets-viewer flattened `train` config, which merges all 5,000 rows), so the predefined splits are respected. - **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2. - **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`). - **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`, cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers. - **Positive class:** Malignant (label 1). - **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset `Johnyquest7/Trakio_agentic_thyroid_dataset`. ## Exact split usage | Split | n | Use | |-------|--:|-----| | Train | 3,500 | **Training only.** | | Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** | | Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. | ## Class distribution | Split | Benign | Malignant | Malignant % | Ratio (M:B) | |-------|-------:|----------:|------------:|------------:| | Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 | | Valid | 125 | 375 | 75.0% | 3.00 : 1 | | Test | 269 | 731 | 73.1% | 2.72 : 1 | Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224×224 RGB PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0 label conflicts**; per-split mean intensity ≈ 81.4 (std ≈ 19.4) — no distribution shift. Conclusion: **no detectable leakage; splits are clean and separate.** ## Literature-informed augmentation rationale Augmentations were restricted to medically plausible B-mode ultrasound transforms (MediAug arXiv:2504.18983 + thyroid-US practice): - **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation (≤10°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast (±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong` ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise. - **Explicitly avoided:** vertical flip (US depth axis is physically meaningful), large rotation/shear (distorts taller-than-wide / margin morphology — TI-RADS malignancy cues), aggressive crop (<0.8, can remove the nodule), and any color/HSV jitter (images are grayscale). Ablation result (val AUROC): `medical_default` **0.9712–0.9756** > `flip_only` **0.9637** > `medical_strong` **0.9609**. The literature-default policy won. ## Model variants tried - `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256→224 preprocessing). - `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) — **selected**. - `timm:resnet18.a2_in1k` (A2 recipe). - Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`). ## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center) Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`, full fine-tune, BCE, AdamW, cosine, ≤40 epochs, early-stop(8). All runs logged to Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0). | Rank | Run | Change vs center | Val AUROC | Best epoch | |-----:|-----|------------------|----------:|-----------:| | 1 | **c12_loss_focal** | **focal γ=1.0, imbalance=none** | **0.9756** | 6 | | 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 | | 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 | | 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 | | 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 | | 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 | | 6 | c00_center_a1 | center config | 0.9712 | 6 | | 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 | | 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 | | 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 | | 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 | | 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 | | 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 | | 14 | c08_aug_strong | strong aug | 0.9609 | 8 | Findings: (1) For this mild (~70/30) imbalance, **focal loss (γ=1.0) and no extra reweighting beat class-weighted BCE and weighted sampling** — heavy reweighting slightly hurt, consistent with the literature. (2) `medical_default` augmentation is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close (a1 ≈ torchvision ≈ a2). Full per-run details: `results/tables/sweep_leaderboard.json`. No excessive trial count was used (14 one-factor trials) to avoid overfitting the 500-image validation set. ## Selected run **c12_loss_focal** — `timm:resnet18.a1_in1k`, focal loss (γ=1.0, α=0.5), imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected **purely on validation AUROC**, before any test access. Config: `configs/final_config.yaml`; weights: `final_model.pt`. ## Calibration decision Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL) gave **T = 0.5646**. Validation ECE 0.0833 → **0.0308**, Brier 0.0592 → 0.0525, AUROC unchanged (0.9756; temperature scaling is monotonic ⇒ discrimination preserved). **Decision: use calibrated probabilities** for thresholding and test reporting. Parameters + before/after metrics: `configs/calibration.json`. Reliability diagrams: `results/figures/{valid,test}_calibration.png`. ## Threshold selection decision On the validation set, using calibrated probabilities, the primary threshold was the **highest-specificity threshold achieving sensitivity ≥ 0.95** (sensitivity- prioritized, clinically motivated). Target was achievable. - **Locked threshold = 0.7113** → validation sensitivity **0.952**, specificity **0.896**. - Secondary reference (Youden's J): coincided at 0.7113 here. Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`. ## Final locked threshold **0.7113139** (on calibrated malignancy probability). ## Final test results with 95% CIs Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified bootstrap, 2000 resamples, seed=42. | Metric | Point | 95% CI | |--------|------:|:------:| | AUROC | 0.9371 | [0.9202, 0.9528] | | Sensitivity | 0.9042 | [0.8824, 0.9248] | | Specificity | 0.7955 | [0.7435, 0.8439] | | PPV | 0.9232 | [0.9054, 0.9401] | | NPV | 0.7535 | [0.7123, 0.7979] | | Accuracy | 0.8750 | [0.8540, 0.8950] | | F1 | 0.9136 | [0.8991, 0.9278] | | Brier | 0.0823 | — | | ECE | 0.0314 | — | Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661. Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions: `results/{valid,test}_predictions.csv`. ## Failed / weaker runs No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation (0.9609) and `flip_only` (0.9637) — both under the `medical_default` baseline, confirming the augmentation policy choice. Earlier trackio smoke runs (`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are infrastructure-validation runs, not experiments. Early Trackio Space creation produced transient 401 `/volumes` warnings until a persistent `dataset_id` (`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter. ## Limitations - Single-source dataset; cropped-ROI inputs; mild class imbalance. - The ≥0.95-sensitivity operating point set on validation yielded **0.904** sensitivity on test — the operating point does not transfer perfectly; ~10% of malignant nodules are missed at the locked threshold. Local threshold re-calibration is advisable before any use. - Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the available signal but cannot exclude same-patient/near-duplicate leakage if such structure exists upstream in TN5000. ## External validation — NOT yet performed No external/independent dataset has been evaluated. `evaluate_external.py` is provided to run the locked model (same preprocessing, calibration T, and locked threshold) on a future external set (folder or CSV format). External, ideally prospective and multi-site, validation is **required** before any clinical use. ## Test-set integrity statement > The Test split was evaluated **exactly once**, and **only after** the model was > selected (by validation AUROC), calibrated (temperature scaling on validation), > and the decision threshold was locked (on validation). No hyperparameter, > calibration, or threshold decision used the test set.