Add full reproducible thyroid ResNet-18 experiment: weights, scripts, configs, calibration, locked threshold, test eval w/ CIs, figures, data exploration, README, LOG
45af8e1 verified | # Experiment Log β Agentic Thyroid ResNet-18 | |
| Chronological, decision-by-decision record for reproducibility and journal review. | |
| ## Provenance | |
| - **Experiment date (UTC):** 2026-06-05 | |
| - **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification` | |
| - Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d` | |
| - Loaded **directly from the Train/Valid/Test folder structure** (NOT the | |
| datasets-viewer flattened `train` config, which merges all 5,000 rows), | |
| so the predefined splits are respected. | |
| - **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2. | |
| - **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, | |
| scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`). | |
| - **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`, | |
| cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers. | |
| - **Positive class:** Malignant (label 1). | |
| - **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard | |
| Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset | |
| `Johnyquest7/Trakio_agentic_thyroid_dataset`. | |
| ## Exact split usage | |
| | Split | n | Use | | |
| |-------|--:|-----| | |
| | Train | 3,500 | **Training only.** | | |
| | Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** | | |
| | Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. | | |
| ## Class distribution | |
| | Split | Benign | Malignant | Malignant % | Ratio (M:B) | | |
| |-------|-------:|----------:|------------:|------------:| | |
| | Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 | | |
| | Valid | 125 | 375 | 75.0% | 3.00 : 1 | | |
| | Test | 269 | 731 | 73.1% | 2.72 : 1 | | |
| Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224Γ224 RGB | |
| PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0 | |
| label conflicts**; per-split mean intensity β 81.4 (std β 19.4) β no distribution | |
| shift. Conclusion: **no detectable leakage; splits are clean and separate.** | |
| ## Literature-informed augmentation rationale | |
| Augmentations were restricted to medically plausible B-mode ultrasound transforms | |
| (MediAug arXiv:2504.18983 + thyroid-US practice): | |
| - **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation | |
| (β€10Β°), mild affine translate (5%) / scale (0.9β1.1), mild brightness/contrast | |
| (Β±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong` | |
| ablation) narrow random-resized-crop (0.8β1.0) + mild speckle noise. | |
| - **Explicitly avoided:** vertical flip (US depth axis is physically meaningful), | |
| large rotation/shear (distorts taller-than-wide / margin morphology β TI-RADS | |
| malignancy cues), aggressive crop (<0.8, can remove the nodule), and any | |
| color/HSV jitter (images are grayscale). | |
| Ablation result (val AUROC): `medical_default` **0.9712β0.9756** > `flip_only` | |
| **0.9637** > `medical_strong` **0.9609**. The literature-default policy won. | |
| ## Model variants tried | |
| - `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256β224 preprocessing). | |
| - `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) β **selected**. | |
| - `timm:resnet18.a2_in1k` (A2 recipe). | |
| - Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`). | |
| ## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center) | |
| Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`, | |
| full fine-tune, BCE, AdamW, cosine, β€40 epochs, early-stop(8). All runs logged to | |
| Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0). | |
| | Rank | Run | Change vs center | Val AUROC | Best epoch | | |
| |-----:|-----|------------------|----------:|-----------:| | |
| | 1 | **c12_loss_focal** | **focal Ξ³=1.0, imbalance=none** | **0.9756** | 6 | | |
| | 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 | | |
| | 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 | | |
| | 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 | | |
| | 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 | | |
| | 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 | | |
| | 6 | c00_center_a1 | center config | 0.9712 | 6 | | |
| | 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 | | |
| | 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 | | |
| | 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 | | |
| | 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 | | |
| | 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 | | |
| | 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 | | |
| | 14 | c08_aug_strong | strong aug | 0.9609 | 8 | | |
| Findings: (1) For this mild (~70/30) imbalance, **focal loss (Ξ³=1.0) and no extra | |
| reweighting beat class-weighted BCE and weighted sampling** β heavy reweighting | |
| slightly hurt, consistent with the literature. (2) `medical_default` augmentation | |
| is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close | |
| (a1 β torchvision β a2). Full per-run details: `results/tables/sweep_leaderboard.json`. | |
| No excessive trial count was used (14 one-factor trials) to avoid overfitting the | |
| 500-image validation set. | |
| ## Selected run | |
| **c12_loss_focal** β `timm:resnet18.a1_in1k`, focal loss (Ξ³=1.0, Ξ±=0.5), | |
| imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full | |
| fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected | |
| **purely on validation AUROC**, before any test access. Config: | |
| `configs/final_config.yaml`; weights: `final_model.pt`. | |
| ## Calibration decision | |
| Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL) | |
| gave **T = 0.5646**. Validation ECE 0.0833 β **0.0308**, Brier 0.0592 β 0.0525, | |
| AUROC unchanged (0.9756; temperature scaling is monotonic β discrimination | |
| preserved). **Decision: use calibrated probabilities** for thresholding and test | |
| reporting. Parameters + before/after metrics: `configs/calibration.json`. | |
| Reliability diagrams: `results/figures/{valid,test}_calibration.png`. | |
| ## Threshold selection decision | |
| On the validation set, using calibrated probabilities, the primary threshold was | |
| the **highest-specificity threshold achieving sensitivity β₯ 0.95** (sensitivity- | |
| prioritized, clinically motivated). Target was achievable. | |
| - **Locked threshold = 0.7113** β validation sensitivity **0.952**, specificity **0.896**. | |
| - Secondary reference (Youden's J): coincided at 0.7113 here. | |
| Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`. | |
| ## Final locked threshold | |
| **0.7113139** (on calibrated malignancy probability). | |
| ## Final test results with 95% CIs | |
| Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified | |
| bootstrap, 2000 resamples, seed=42. | |
| | Metric | Point | 95% CI | | |
| |--------|------:|:------:| | |
| | AUROC | 0.9371 | [0.9202, 0.9528] | | |
| | Sensitivity | 0.9042 | [0.8824, 0.9248] | | |
| | Specificity | 0.7955 | [0.7435, 0.8439] | | |
| | PPV | 0.9232 | [0.9054, 0.9401] | | |
| | NPV | 0.7535 | [0.7123, 0.7979] | | |
| | Accuracy | 0.8750 | [0.8540, 0.8950] | | |
| | F1 | 0.9136 | [0.8991, 0.9278] | | |
| | Brier | 0.0823 | β | | |
| | ECE | 0.0314 | β | | |
| Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661. | |
| Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions: | |
| `results/{valid,test}_predictions.csv`. | |
| ## Failed / weaker runs | |
| No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation | |
| (0.9609) and `flip_only` (0.9637) β both under the `medical_default` baseline, | |
| confirming the augmentation policy choice. Earlier trackio smoke runs | |
| (`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are | |
| infrastructure-validation runs, not experiments. Early Trackio Space creation | |
| produced transient 401 `/volumes` warnings until a persistent `dataset_id` | |
| (`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter. | |
| ## Limitations | |
| - Single-source dataset; cropped-ROI inputs; mild class imbalance. | |
| - The β₯0.95-sensitivity operating point set on validation yielded **0.904** | |
| sensitivity on test β the operating point does not transfer perfectly; ~10% of | |
| malignant nodules are missed at the locked threshold. Local threshold | |
| re-calibration is advisable before any use. | |
| - Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the | |
| available signal but cannot exclude same-patient/near-duplicate leakage if such | |
| structure exists upstream in TN5000. | |
| ## External validation β NOT yet performed | |
| No external/independent dataset has been evaluated. `evaluate_external.py` is | |
| provided to run the locked model (same preprocessing, calibration T, and locked | |
| threshold) on a future external set (folder or CSV format). External, ideally | |
| prospective and multi-site, validation is **required** before any clinical use. | |
| ## Test-set integrity statement | |
| > The Test split was evaluated **exactly once**, and **only after** the model was | |
| > selected (by validation AUROC), calibrated (temperature scaling on validation), | |
| > and the decision threshold was locked (on validation). No hyperparameter, | |
| > calibration, or threshold decision used the test set. | |