Johnyquest7's picture
Add full reproducible thyroid ResNet-18 experiment: weights, scripts, configs, calibration, locked threshold, test eval w/ CIs, figures, data exploration, README, LOG
45af8e1 verified
|
Raw
History Blame Contribute Delete
9.05 kB
# Experiment Log β€” Agentic Thyroid ResNet-18
Chronological, decision-by-decision record for reproducibility and journal review.
## Provenance
- **Experiment date (UTC):** 2026-06-05
- **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification`
- Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d`
- Loaded **directly from the Train/Valid/Test folder structure** (NOT the
datasets-viewer flattened `train` config, which merges all 5,000 rows),
so the predefined splits are respected.
- **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
- **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`).
- **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`,
cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers.
- **Positive class:** Malignant (label 1).
- **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard
Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset
`Johnyquest7/Trakio_agentic_thyroid_dataset`.
## Exact split usage
| Split | n | Use |
|-------|--:|-----|
| Train | 3,500 | **Training only.** |
| Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** |
| Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. |
## Class distribution
| Split | Benign | Malignant | Malignant % | Ratio (M:B) |
|-------|-------:|----------:|------------:|------------:|
| Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 |
| Valid | 125 | 375 | 75.0% | 3.00 : 1 |
| Test | 269 | 731 | 73.1% | 2.72 : 1 |
Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224Γ—224 RGB
PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0
label conflicts**; per-split mean intensity β‰ˆ 81.4 (std β‰ˆ 19.4) β€” no distribution
shift. Conclusion: **no detectable leakage; splits are clean and separate.**
## Literature-informed augmentation rationale
Augmentations were restricted to medically plausible B-mode ultrasound transforms
(MediAug arXiv:2504.18983 + thyroid-US practice):
- **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation
(≀10Β°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast
(Β±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong`
ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
- **Explicitly avoided:** vertical flip (US depth axis is physically meaningful),
large rotation/shear (distorts taller-than-wide / margin morphology β€” TI-RADS
malignancy cues), aggressive crop (<0.8, can remove the nodule), and any
color/HSV jitter (images are grayscale).
Ablation result (val AUROC): `medical_default` **0.9712–0.9756** > `flip_only`
**0.9637** > `medical_strong` **0.9609**. The literature-default policy won.
## Model variants tried
- `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256β†’224 preprocessing).
- `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) β€” **selected**.
- `timm:resnet18.a2_in1k` (A2 recipe).
- Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`).
## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)
Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`,
full fine-tune, BCE, AdamW, cosine, ≀40 epochs, early-stop(8). All runs logged to
Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0).
| Rank | Run | Change vs center | Val AUROC | Best epoch |
|-----:|-----|------------------|----------:|-----------:|
| 1 | **c12_loss_focal** | **focal Ξ³=1.0, imbalance=none** | **0.9756** | 6 |
| 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 |
| 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 |
| 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 |
| 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 |
| 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 |
| 6 | c00_center_a1 | center config | 0.9712 | 6 |
| 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 |
| 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 |
| 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 |
| 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 |
| 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 |
| 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 |
| 14 | c08_aug_strong | strong aug | 0.9609 | 8 |
Findings: (1) For this mild (~70/30) imbalance, **focal loss (Ξ³=1.0) and no extra
reweighting beat class-weighted BCE and weighted sampling** β€” heavy reweighting
slightly hurt, consistent with the literature. (2) `medical_default` augmentation
is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close
(a1 β‰ˆ torchvision β‰ˆ a2). Full per-run details: `results/tables/sweep_leaderboard.json`.
No excessive trial count was used (14 one-factor trials) to avoid overfitting the
500-image validation set.
## Selected run
**c12_loss_focal** β€” `timm:resnet18.a1_in1k`, focal loss (Ξ³=1.0, Ξ±=0.5),
imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full
fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected
**purely on validation AUROC**, before any test access. Config:
`configs/final_config.yaml`; weights: `final_model.pt`.
## Calibration decision
Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL)
gave **T = 0.5646**. Validation ECE 0.0833 β†’ **0.0308**, Brier 0.0592 β†’ 0.0525,
AUROC unchanged (0.9756; temperature scaling is monotonic β‡’ discrimination
preserved). **Decision: use calibrated probabilities** for thresholding and test
reporting. Parameters + before/after metrics: `configs/calibration.json`.
Reliability diagrams: `results/figures/{valid,test}_calibration.png`.
## Threshold selection decision
On the validation set, using calibrated probabilities, the primary threshold was
the **highest-specificity threshold achieving sensitivity β‰₯ 0.95** (sensitivity-
prioritized, clinically motivated). Target was achievable.
- **Locked threshold = 0.7113** β†’ validation sensitivity **0.952**, specificity **0.896**.
- Secondary reference (Youden's J): coincided at 0.7113 here.
Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`.
## Final locked threshold
**0.7113139** (on calibrated malignancy probability).
## Final test results with 95% CIs
Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified
bootstrap, 2000 resamples, seed=42.
| Metric | Point | 95% CI |
|--------|------:|:------:|
| AUROC | 0.9371 | [0.9202, 0.9528] |
| Sensitivity | 0.9042 | [0.8824, 0.9248] |
| Specificity | 0.7955 | [0.7435, 0.8439] |
| PPV | 0.9232 | [0.9054, 0.9401] |
| NPV | 0.7535 | [0.7123, 0.7979] |
| Accuracy | 0.8750 | [0.8540, 0.8950] |
| F1 | 0.9136 | [0.8991, 0.9278] |
| Brier | 0.0823 | β€” |
| ECE | 0.0314 | β€” |
Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions:
`results/{valid,test}_predictions.csv`.
## Failed / weaker runs
No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation
(0.9609) and `flip_only` (0.9637) β€” both under the `medical_default` baseline,
confirming the augmentation policy choice. Earlier trackio smoke runs
(`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are
infrastructure-validation runs, not experiments. Early Trackio Space creation
produced transient 401 `/volumes` warnings until a persistent `dataset_id`
(`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter.
## Limitations
- Single-source dataset; cropped-ROI inputs; mild class imbalance.
- The β‰₯0.95-sensitivity operating point set on validation yielded **0.904**
sensitivity on test β€” the operating point does not transfer perfectly; ~10% of
malignant nodules are missed at the locked threshold. Local threshold
re-calibration is advisable before any use.
- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the
available signal but cannot exclude same-patient/near-duplicate leakage if such
structure exists upstream in TN5000.
## External validation β€” NOT yet performed
No external/independent dataset has been evaluated. `evaluate_external.py` is
provided to run the locked model (same preprocessing, calibration T, and locked
threshold) on a future external set (folder or CSV format). External, ideally
prospective and multi-site, validation is **required** before any clinical use.
## Test-set integrity statement
> The Test split was evaluated **exactly once**, and **only after** the model was
> selected (by validation AUROC), calibrated (temperature scaling on validation),
> and the decision threshold was locked (on validation). No hyperparameter,
> calibration, or threshold decision used the test set.