File size: 9,047 Bytes
45af8e1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | # Experiment Log β Agentic Thyroid ResNet-18
Chronological, decision-by-decision record for reproducibility and journal review.
## Provenance
- **Experiment date (UTC):** 2026-06-05
- **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification`
- Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d`
- Loaded **directly from the Train/Valid/Test folder structure** (NOT the
datasets-viewer flattened `train` config, which merges all 5,000 rows),
so the predefined splits are respected.
- **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
- **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`).
- **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`,
cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers.
- **Positive class:** Malignant (label 1).
- **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard
Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset
`Johnyquest7/Trakio_agentic_thyroid_dataset`.
## Exact split usage
| Split | n | Use |
|-------|--:|-----|
| Train | 3,500 | **Training only.** |
| Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** |
| Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. |
## Class distribution
| Split | Benign | Malignant | Malignant % | Ratio (M:B) |
|-------|-------:|----------:|------------:|------------:|
| Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 |
| Valid | 125 | 375 | 75.0% | 3.00 : 1 |
| Test | 269 | 731 | 73.1% | 2.72 : 1 |
Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224Γ224 RGB
PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0
label conflicts**; per-split mean intensity β 81.4 (std β 19.4) β no distribution
shift. Conclusion: **no detectable leakage; splits are clean and separate.**
## Literature-informed augmentation rationale
Augmentations were restricted to medically plausible B-mode ultrasound transforms
(MediAug arXiv:2504.18983 + thyroid-US practice):
- **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation
(β€10Β°), mild affine translate (5%) / scale (0.9β1.1), mild brightness/contrast
(Β±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong`
ablation) narrow random-resized-crop (0.8β1.0) + mild speckle noise.
- **Explicitly avoided:** vertical flip (US depth axis is physically meaningful),
large rotation/shear (distorts taller-than-wide / margin morphology β TI-RADS
malignancy cues), aggressive crop (<0.8, can remove the nodule), and any
color/HSV jitter (images are grayscale).
Ablation result (val AUROC): `medical_default` **0.9712β0.9756** > `flip_only`
**0.9637** > `medical_strong` **0.9609**. The literature-default policy won.
## Model variants tried
- `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256β224 preprocessing).
- `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) β **selected**.
- `timm:resnet18.a2_in1k` (A2 recipe).
- Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`).
## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)
Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`,
full fine-tune, BCE, AdamW, cosine, β€40 epochs, early-stop(8). All runs logged to
Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0).
| Rank | Run | Change vs center | Val AUROC | Best epoch |
|-----:|-----|------------------|----------:|-----------:|
| 1 | **c12_loss_focal** | **focal Ξ³=1.0, imbalance=none** | **0.9756** | 6 |
| 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 |
| 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 |
| 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 |
| 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 |
| 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 |
| 6 | c00_center_a1 | center config | 0.9712 | 6 |
| 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 |
| 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 |
| 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 |
| 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 |
| 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 |
| 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 |
| 14 | c08_aug_strong | strong aug | 0.9609 | 8 |
Findings: (1) For this mild (~70/30) imbalance, **focal loss (Ξ³=1.0) and no extra
reweighting beat class-weighted BCE and weighted sampling** β heavy reweighting
slightly hurt, consistent with the literature. (2) `medical_default` augmentation
is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close
(a1 β torchvision β a2). Full per-run details: `results/tables/sweep_leaderboard.json`.
No excessive trial count was used (14 one-factor trials) to avoid overfitting the
500-image validation set.
## Selected run
**c12_loss_focal** β `timm:resnet18.a1_in1k`, focal loss (Ξ³=1.0, Ξ±=0.5),
imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full
fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected
**purely on validation AUROC**, before any test access. Config:
`configs/final_config.yaml`; weights: `final_model.pt`.
## Calibration decision
Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL)
gave **T = 0.5646**. Validation ECE 0.0833 β **0.0308**, Brier 0.0592 β 0.0525,
AUROC unchanged (0.9756; temperature scaling is monotonic β discrimination
preserved). **Decision: use calibrated probabilities** for thresholding and test
reporting. Parameters + before/after metrics: `configs/calibration.json`.
Reliability diagrams: `results/figures/{valid,test}_calibration.png`.
## Threshold selection decision
On the validation set, using calibrated probabilities, the primary threshold was
the **highest-specificity threshold achieving sensitivity β₯ 0.95** (sensitivity-
prioritized, clinically motivated). Target was achievable.
- **Locked threshold = 0.7113** β validation sensitivity **0.952**, specificity **0.896**.
- Secondary reference (Youden's J): coincided at 0.7113 here.
Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`.
## Final locked threshold
**0.7113139** (on calibrated malignancy probability).
## Final test results with 95% CIs
Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified
bootstrap, 2000 resamples, seed=42.
| Metric | Point | 95% CI |
|--------|------:|:------:|
| AUROC | 0.9371 | [0.9202, 0.9528] |
| Sensitivity | 0.9042 | [0.8824, 0.9248] |
| Specificity | 0.7955 | [0.7435, 0.8439] |
| PPV | 0.9232 | [0.9054, 0.9401] |
| NPV | 0.7535 | [0.7123, 0.7979] |
| Accuracy | 0.8750 | [0.8540, 0.8950] |
| F1 | 0.9136 | [0.8991, 0.9278] |
| Brier | 0.0823 | β |
| ECE | 0.0314 | β |
Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions:
`results/{valid,test}_predictions.csv`.
## Failed / weaker runs
No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation
(0.9609) and `flip_only` (0.9637) β both under the `medical_default` baseline,
confirming the augmentation policy choice. Earlier trackio smoke runs
(`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are
infrastructure-validation runs, not experiments. Early Trackio Space creation
produced transient 401 `/volumes` warnings until a persistent `dataset_id`
(`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter.
## Limitations
- Single-source dataset; cropped-ROI inputs; mild class imbalance.
- The β₯0.95-sensitivity operating point set on validation yielded **0.904**
sensitivity on test β the operating point does not transfer perfectly; ~10% of
malignant nodules are missed at the locked threshold. Local threshold
re-calibration is advisable before any use.
- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the
available signal but cannot exclude same-patient/near-duplicate leakage if such
structure exists upstream in TN5000.
## External validation β NOT yet performed
No external/independent dataset has been evaluated. `evaluate_external.py` is
provided to run the locked model (same preprocessing, calibration T, and locked
threshold) on a future external set (folder or CSV format). External, ideally
prospective and multi-site, validation is **required** before any clinical use.
## Test-set integrity statement
> The Test split was evaluated **exactly once**, and **only after** the model was
> selected (by validation AUROC), calibrated (temperature scaling on validation),
> and the decision threshold was locked (on validation). No hyperparameter,
> calibration, or threshold decision used the test set.
|