Experiment Log β Agentic Thyroid ResNet-18
Chronological, decision-by-decision record for reproducibility and journal review.
Provenance
- Experiment date (UTC): 2026-06-05
- Dataset:
Johnyquest7/TN5000-thyroid-nodule-classification- Commit SHA:
73d6c0713a89c8c07125fe6ffc5956be60e9853d - Loaded directly from the Train/Valid/Test folder structure (NOT the
datasets-viewer flattened
trainconfig, which merges all 5,000 rows), so the predefined splits are respected.
- Commit SHA:
- Compute: Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
- Key packages: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see
configs/env_info.json). - Global seed: 42. Strict determinism:
torch.use_deterministic_algorithms(True), cuDNN deterministic,CUBLAS_WORKSPACE_CONFIG=:4096:8, seeded DataLoader workers. - Positive class: Malignant (label 1).
- Experiment tracking: Trackio project
agentic_thyroid_resnet18, dashboard SpaceJohnyquest7/Trakio_agentic_thyroid, storage datasetJohnyquest7/Trakio_agentic_thyroid_dataset.
Exact split usage
| Split | n | Use |
|---|---|---|
| Train | 3,500 | Training only. |
| Valid | 500 | Model selection (val AUROC), calibration, threshold selection. |
| Test | 1,000 | Final locked evaluation, exactly once, after model+calibration+threshold were frozen. |
Class distribution
| Split | Benign | Malignant | Malignant % | Ratio (M:B) |
|---|---|---|---|---|
| Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 |
| Valid | 125 | 375 | 75.0% | 3.00 : 1 |
| Test | 269 | 731 | 73.1% | 2.72 : 1 |
Data audit (data_exploration_report.md): 0 corrupt images, all 224Γ224 RGB
PNG; 0 cross-split exact-pixel duplicates, 0 filename-ID overlaps, 0
label conflicts; per-split mean intensity β 81.4 (std β 19.4) β no distribution
shift. Conclusion: no detectable leakage; splits are clean and separate.
Literature-informed augmentation rationale
Augmentations were restricted to medically plausible B-mode ultrasound transforms (MediAug arXiv:2504.18983 + thyroid-US practice):
- Kept: horizontal flip (thyroid is bilaterally symmetric), small rotation
(β€10Β°), mild affine translate (5%) / scale (0.9β1.1), mild brightness/contrast
(Β±15%, simulates gain/TGC), light Gaussian blur, and (in the
medical_strongablation) narrow random-resized-crop (0.8β1.0) + mild speckle noise. - Explicitly avoided: vertical flip (US depth axis is physically meaningful), large rotation/shear (distorts taller-than-wide / margin morphology β TI-RADS malignancy cues), aggressive crop (<0.8, can remove the nodule), and any color/HSV jitter (images are grayscale).
Ablation result (val AUROC): medical_default 0.9712β0.9756 > flip_only
0.9637 > medical_strong 0.9609. The literature-default policy won.
Model variants tried
torchvisionResNet-18 (ImageNet1K_V1, bilinear/256β224 preprocessing).timm:resnet18.a1_in1k(A1 recipe, bicubic, crop_pct 0.95) β selected.timm:resnet18.a2_in1k(A2 recipe).- Fine-tune depth: full fine-tune vs freeze stem+layer1 (
freeze_stage=1).
Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)
Center: timm a1, lr 2e-4, wd 1e-4, bs 32, medical_default, pos_weight,
full fine-tune, BCE, AdamW, cosine, β€40 epochs, early-stop(8). All runs logged to
Trackio. Selection metric: validation AUROC. All 14 trials completed (rc=0).
| Rank | Run | Change vs center | Val AUROC | Best epoch |
|---|---|---|---|---|
| 1 | c12_loss_focal | focal Ξ³=1.0, imbalance=none | 0.9756 | 6 |
| 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 |
| 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 |
| 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 |
| 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 |
| 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 |
| 6 | c00_center_a1 | center config | 0.9712 | 6 |
| 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 |
| 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 |
| 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 |
| 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 |
| 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 |
| 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 |
| 14 | c08_aug_strong | strong aug | 0.9609 | 8 |
Findings: (1) For this mild (~70/30) imbalance, focal loss (Ξ³=1.0) and no extra
reweighting beat class-weighted BCE and weighted sampling β heavy reweighting
slightly hurt, consistent with the literature. (2) medical_default augmentation
is the sweet spot. (3) Full fine-tune > freezing. (4) Backbones were close
(a1 β torchvision β a2). Full per-run details: results/tables/sweep_leaderboard.json.
No excessive trial count was used (14 one-factor trials) to avoid overfitting the 500-image validation set.
Selected run
c12_loss_focal β timm:resnet18.a1_in1k, focal loss (Ξ³=1.0, Ξ±=0.5),
imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, medical_default aug, full
fine-tune, cosine schedule, best epoch 6, validation AUROC 0.9756. Selected
purely on validation AUROC, before any test access. Config:
configs/final_config.yaml; weights: final_model.pt.
Calibration decision
Assessed on validation. Temperature scaling (single parameter, LBFGS on NLL)
gave T = 0.5646. Validation ECE 0.0833 β 0.0308, Brier 0.0592 β 0.0525,
AUROC unchanged (0.9756; temperature scaling is monotonic β discrimination
preserved). Decision: use calibrated probabilities for thresholding and test
reporting. Parameters + before/after metrics: configs/calibration.json.
Reliability diagrams: results/figures/{valid,test}_calibration.png.
Threshold selection decision
On the validation set, using calibrated probabilities, the primary threshold was the highest-specificity threshold achieving sensitivity β₯ 0.95 (sensitivity- prioritized, clinically motivated). Target was achievable.
- Locked threshold = 0.7113 β validation sensitivity 0.952, specificity 0.896.
- Secondary reference (Youden's J): coincided at 0.7113 here.
Threshold locked before the test set was evaluated. Config: configs/threshold.json.
Final locked threshold
0.7113139 (on calibrated malignancy probability).
Final test results with 95% CIs
Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified bootstrap, 2000 resamples, seed=42.
| Metric | Point | 95% CI |
|---|---|---|
| AUROC | 0.9371 | [0.9202, 0.9528] |
| Sensitivity | 0.9042 | [0.8824, 0.9248] |
| Specificity | 0.7955 | [0.7435, 0.8439] |
| PPV | 0.9232 | [0.9054, 0.9401] |
| NPV | 0.7535 | [0.7123, 0.7979] |
| Accuracy | 0.8750 | [0.8540, 0.8950] |
| F1 | 0.9136 | [0.8991, 0.9278] |
| Brier | 0.0823 | β |
| ECE | 0.0314 | β |
Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
Tables: results/tables/test_metrics_with_ci.{md,csv}; per-image predictions:
results/{valid,test}_predictions.csv.
Failed / weaker runs
No run errored (all rc=0). Weakest configurations: medical_strong augmentation
(0.9609) and flip_only (0.9637) β both under the medical_default baseline,
confirming the augmentation policy choice. Earlier trackio smoke runs
(smoke_trackio_test*, connectivity_check, dataset_pin_check) are
infrastructure-validation runs, not experiments. Early Trackio Space creation
produced transient 401 /volumes warnings until a persistent dataset_id
(Johnyquest7/Trakio_agentic_thyroid_dataset) was pinned; resolved thereafter.
Limitations
- Single-source dataset; cropped-ROI inputs; mild class imbalance.
- The β₯0.95-sensitivity operating point set on validation yielded 0.904 sensitivity on test β the operating point does not transfer perfectly; ~10% of malignant nodules are missed at the locked threshold. Local threshold re-calibration is advisable before any use.
- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the available signal but cannot exclude same-patient/near-duplicate leakage if such structure exists upstream in TN5000.
External validation β NOT yet performed
No external/independent dataset has been evaluated. evaluate_external.py is
provided to run the locked model (same preprocessing, calibration T, and locked
threshold) on a future external set (folder or CSV format). External, ideally
prospective and multi-site, validation is required before any clinical use.
Test-set integrity statement
The Test split was evaluated exactly once, and only after the model was selected (by validation AUROC), calibrated (temperature scaling on validation), and the decision threshold was locked (on validation). No hyperparameter, calibration, or threshold decision used the test set.