Johnyquest7's picture
Add full reproducible thyroid ResNet-18 experiment: weights, scripts, configs, calibration, locked threshold, test eval w/ CIs, figures, data exploration, README, LOG
45af8e1 verified
|
Raw
History Blame Contribute Delete
9.05 kB

Experiment Log β€” Agentic Thyroid ResNet-18

Chronological, decision-by-decision record for reproducibility and journal review.

Provenance

  • Experiment date (UTC): 2026-06-05
  • Dataset: Johnyquest7/TN5000-thyroid-nodule-classification
    • Commit SHA: 73d6c0713a89c8c07125fe6ffc5956be60e9853d
    • Loaded directly from the Train/Valid/Test folder structure (NOT the datasets-viewer flattened train config, which merges all 5,000 rows), so the predefined splits are respected.
  • Compute: Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
  • Key packages: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see configs/env_info.json).
  • Global seed: 42. Strict determinism: torch.use_deterministic_algorithms(True), cuDNN deterministic, CUBLAS_WORKSPACE_CONFIG=:4096:8, seeded DataLoader workers.
  • Positive class: Malignant (label 1).
  • Experiment tracking: Trackio project agentic_thyroid_resnet18, dashboard Space Johnyquest7/Trakio_agentic_thyroid, storage dataset Johnyquest7/Trakio_agentic_thyroid_dataset.

Exact split usage

Split n Use
Train 3,500 Training only.
Valid 500 Model selection (val AUROC), calibration, threshold selection.
Test 1,000 Final locked evaluation, exactly once, after model+calibration+threshold were frozen.

Class distribution

Split Benign Malignant Malignant % Ratio (M:B)
Train 1,032 2,468 70.5% 2.39 : 1
Valid 125 375 75.0% 3.00 : 1
Test 269 731 73.1% 2.72 : 1

Data audit (data_exploration_report.md): 0 corrupt images, all 224Γ—224 RGB PNG; 0 cross-split exact-pixel duplicates, 0 filename-ID overlaps, 0 label conflicts; per-split mean intensity β‰ˆ 81.4 (std β‰ˆ 19.4) β€” no distribution shift. Conclusion: no detectable leakage; splits are clean and separate.

Literature-informed augmentation rationale

Augmentations were restricted to medically plausible B-mode ultrasound transforms (MediAug arXiv:2504.18983 + thyroid-US practice):

  • Kept: horizontal flip (thyroid is bilaterally symmetric), small rotation (≀10Β°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast (Β±15%, simulates gain/TGC), light Gaussian blur, and (in the medical_strong ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
  • Explicitly avoided: vertical flip (US depth axis is physically meaningful), large rotation/shear (distorts taller-than-wide / margin morphology β€” TI-RADS malignancy cues), aggressive crop (<0.8, can remove the nodule), and any color/HSV jitter (images are grayscale).

Ablation result (val AUROC): medical_default 0.9712–0.9756 > flip_only 0.9637 > medical_strong 0.9609. The literature-default policy won.

Model variants tried

  • torchvision ResNet-18 (ImageNet1K_V1, bilinear/256β†’224 preprocessing).
  • timm:resnet18.a1_in1k (A1 recipe, bicubic, crop_pct 0.95) β€” selected.
  • timm:resnet18.a2_in1k (A2 recipe).
  • Fine-tune depth: full fine-tune vs freeze stem+layer1 (freeze_stage=1).

Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)

Center: timm a1, lr 2e-4, wd 1e-4, bs 32, medical_default, pos_weight, full fine-tune, BCE, AdamW, cosine, ≀40 epochs, early-stop(8). All runs logged to Trackio. Selection metric: validation AUROC. All 14 trials completed (rc=0).

Rank Run Change vs center Val AUROC Best epoch
1 c12_loss_focal focal Ξ³=1.0, imbalance=none 0.9756 6
2 c09_imb_none imbalance=none (BCE) 0.9739 6
3 c01_backbone_torchvision torchvision backbone 0.9731 7
4 c03_lr_1e-4 lr 1e-4 0.9721 11
5 c06_bs_64 batch size 64 0.9717 8
6 c05_wd_1e-3 weight decay 1e-3 0.9712 9
6 c00_center_a1 center config 0.9712 6
8 c10_imb_sampler weighted sampler 0.9697 13
9 c02_backbone_a2 a2 backbone 0.9693 6
10 c04_lr_5e-4 lr 5e-4 0.9675 9
11 c11_freeze1 freeze stem+layer1 0.9672 6
12 c13_lr1e-4_wd1e-3_drop lr1e-4+wd1e-3+dropout0.2 0.9657 11
13 c07_aug_flip_only flip-only aug 0.9637 6
14 c08_aug_strong strong aug 0.9609 8

Findings: (1) For this mild (~70/30) imbalance, focal loss (Ξ³=1.0) and no extra reweighting beat class-weighted BCE and weighted sampling β€” heavy reweighting slightly hurt, consistent with the literature. (2) medical_default augmentation is the sweet spot. (3) Full fine-tune > freezing. (4) Backbones were close (a1 β‰ˆ torchvision β‰ˆ a2). Full per-run details: results/tables/sweep_leaderboard.json.

No excessive trial count was used (14 one-factor trials) to avoid overfitting the 500-image validation set.

Selected run

c12_loss_focal β€” timm:resnet18.a1_in1k, focal loss (Ξ³=1.0, Ξ±=0.5), imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, medical_default aug, full fine-tune, cosine schedule, best epoch 6, validation AUROC 0.9756. Selected purely on validation AUROC, before any test access. Config: configs/final_config.yaml; weights: final_model.pt.

Calibration decision

Assessed on validation. Temperature scaling (single parameter, LBFGS on NLL) gave T = 0.5646. Validation ECE 0.0833 β†’ 0.0308, Brier 0.0592 β†’ 0.0525, AUROC unchanged (0.9756; temperature scaling is monotonic β‡’ discrimination preserved). Decision: use calibrated probabilities for thresholding and test reporting. Parameters + before/after metrics: configs/calibration.json. Reliability diagrams: results/figures/{valid,test}_calibration.png.

Threshold selection decision

On the validation set, using calibrated probabilities, the primary threshold was the highest-specificity threshold achieving sensitivity β‰₯ 0.95 (sensitivity- prioritized, clinically motivated). Target was achievable.

  • Locked threshold = 0.7113 β†’ validation sensitivity 0.952, specificity 0.896.
  • Secondary reference (Youden's J): coincided at 0.7113 here.

Threshold locked before the test set was evaluated. Config: configs/threshold.json.

Final locked threshold

0.7113139 (on calibrated malignancy probability).

Final test results with 95% CIs

Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified bootstrap, 2000 resamples, seed=42.

Metric Point 95% CI
AUROC 0.9371 [0.9202, 0.9528]
Sensitivity 0.9042 [0.8824, 0.9248]
Specificity 0.7955 [0.7435, 0.8439]
PPV 0.9232 [0.9054, 0.9401]
NPV 0.7535 [0.7123, 0.7979]
Accuracy 0.8750 [0.8540, 0.8950]
F1 0.9136 [0.8991, 0.9278]
Brier 0.0823 β€”
ECE 0.0314 β€”

Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661. Tables: results/tables/test_metrics_with_ci.{md,csv}; per-image predictions: results/{valid,test}_predictions.csv.

Failed / weaker runs

No run errored (all rc=0). Weakest configurations: medical_strong augmentation (0.9609) and flip_only (0.9637) β€” both under the medical_default baseline, confirming the augmentation policy choice. Earlier trackio smoke runs (smoke_trackio_test*, connectivity_check, dataset_pin_check) are infrastructure-validation runs, not experiments. Early Trackio Space creation produced transient 401 /volumes warnings until a persistent dataset_id (Johnyquest7/Trakio_agentic_thyroid_dataset) was pinned; resolved thereafter.

Limitations

  • Single-source dataset; cropped-ROI inputs; mild class imbalance.
  • The β‰₯0.95-sensitivity operating point set on validation yielded 0.904 sensitivity on test β€” the operating point does not transfer perfectly; ~10% of malignant nodules are missed at the locked threshold. Local threshold re-calibration is advisable before any use.
  • Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the available signal but cannot exclude same-patient/near-duplicate leakage if such structure exists upstream in TN5000.

External validation β€” NOT yet performed

No external/independent dataset has been evaluated. evaluate_external.py is provided to run the locked model (same preprocessing, calibration T, and locked threshold) on a future external set (folder or CSV format). External, ideally prospective and multi-site, validation is required before any clinical use.

Test-set integrity statement

The Test split was evaluated exactly once, and only after the model was selected (by validation AUROC), calibrated (temperature scaling on validation), and the decision threshold was locked (on validation). No hyperparameter, calibration, or threshold decision used the test set.