Add full reproducible thyroid ResNet-18 experiment: weights, scripts, configs, calibration, locked threshold, test eval w/ CIs, figures, data exploration, README, LOG

45af8e1 verified 22 days ago

preview code

Raw

History Blame Contribute Delete

9.05 kB

Experiment Log — Agentic Thyroid ResNet-18

Chronological, decision-by-decision record for reproducibility and journal review.

Provenance

Experiment date (UTC): 2026-06-05
Dataset: Johnyquest7/TN5000-thyroid-nodule-classification
- Commit SHA: 73d6c0713a89c8c07125fe6ffc5956be60e9853d
- Loaded directly from the Train/Valid/Test folder structure (NOT the datasets-viewer flattened train config, which merges all 5,000 rows), so the predefined splits are respected.
Compute: Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
Key packages: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see configs/env_info.json).
Global seed: 42. Strict determinism: torch.use_deterministic_algorithms(True), cuDNN deterministic, CUBLAS_WORKSPACE_CONFIG=:4096:8, seeded DataLoader workers.
Positive class: Malignant (label 1).
Experiment tracking: Trackio project agentic_thyroid_resnet18, dashboard Space Johnyquest7/Trakio_agentic_thyroid, storage dataset Johnyquest7/Trakio_agentic_thyroid_dataset.

Exact split usage

Split	n	Use
Train	3,500	Training only.
Valid	500	Model selection (val AUROC), calibration, threshold selection.
Test	1,000	Final locked evaluation, exactly once, after model+calibration+threshold were frozen.

Class distribution

Split	Benign	Malignant	Malignant %	Ratio (M:B)
Train	1,032	2,468	70.5%	2.39 : 1
Valid	125	375	75.0%	3.00 : 1
Test	269	731	73.1%	2.72 : 1

Data audit (data_exploration_report.md): 0 corrupt images, all 224×224 RGB PNG; 0 cross-split exact-pixel duplicates, 0 filename-ID overlaps, 0 label conflicts; per-split mean intensity ≈ 81.4 (std ≈ 19.4) — no distribution shift. Conclusion: no detectable leakage; splits are clean and separate.

Literature-informed augmentation rationale

Augmentations were restricted to medically plausible B-mode ultrasound transforms (MediAug arXiv:2504.18983 + thyroid-US practice):

Kept: horizontal flip (thyroid is bilaterally symmetric), small rotation (≤10°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast (±15%, simulates gain/TGC), light Gaussian blur, and (in the medical_strong ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
Explicitly avoided: vertical flip (US depth axis is physically meaningful), large rotation/shear (distorts taller-than-wide / margin morphology — TI-RADS malignancy cues), aggressive crop (<0.8, can remove the nodule), and any color/HSV jitter (images are grayscale).

Ablation result (val AUROC): medical_default 0.9712–0.9756 > flip_only 0.9637 > medical_strong 0.9609. The literature-default policy won.

Model variants tried

torchvision ResNet-18 (ImageNet1K_V1, bilinear/256→224 preprocessing).
timm:resnet18.a1_in1k (A1 recipe, bicubic, crop_pct 0.95) — selected.
timm:resnet18.a2_in1k (A2 recipe).
Fine-tune depth: full fine-tune vs freeze stem+layer1 (freeze_stage=1).

Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)

Center: timm a1, lr 2e-4, wd 1e-4, bs 32, medical_default, pos_weight, full fine-tune, BCE, AdamW, cosine, ≤40 epochs, early-stop(8). All runs logged to Trackio. Selection metric: validation AUROC. All 14 trials completed (rc=0).

Rank	Run	Change vs center	Val AUROC	Best epoch
1	c12_loss_focal	focal γ=1.0, imbalance=none	0.9756	6
2	c09_imb_none	imbalance=none (BCE)	0.9739	6
3	c01_backbone_torchvision	torchvision backbone	0.9731	7
4	c03_lr_1e-4	lr 1e-4	0.9721	11
5	c06_bs_64	batch size 64	0.9717	8
6	c05_wd_1e-3	weight decay 1e-3	0.9712	9
6	c00_center_a1	center config	0.9712	6
8	c10_imb_sampler	weighted sampler	0.9697	13
9	c02_backbone_a2	a2 backbone	0.9693	6
10	c04_lr_5e-4	lr 5e-4	0.9675	9
11	c11_freeze1	freeze stem+layer1	0.9672	6
12	c13_lr1e-4_wd1e-3_drop	lr1e-4+wd1e-3+dropout0.2	0.9657	11
13	c07_aug_flip_only	flip-only aug	0.9637	6
14	c08_aug_strong	strong aug	0.9609	8

Findings: (1) For this mild (~70/30) imbalance, focal loss (γ=1.0) and no extra reweighting beat class-weighted BCE and weighted sampling — heavy reweighting slightly hurt, consistent with the literature. (2) medical_default augmentation is the sweet spot. (3) Full fine-tune > freezing. (4) Backbones were close (a1 ≈ torchvision ≈ a2). Full per-run details: results/tables/sweep_leaderboard.json.

No excessive trial count was used (14 one-factor trials) to avoid overfitting the 500-image validation set.

Selected run

c12_loss_focal — timm:resnet18.a1_in1k, focal loss (γ=1.0, α=0.5), imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, medical_default aug, full fine-tune, cosine schedule, best epoch 6, validation AUROC 0.9756. Selected purely on validation AUROC, before any test access. Config: configs/final_config.yaml; weights: final_model.pt.

Calibration decision

Assessed on validation. Temperature scaling (single parameter, LBFGS on NLL) gave T = 0.5646. Validation ECE 0.0833 → 0.0308, Brier 0.0592 → 0.0525, AUROC unchanged (0.9756; temperature scaling is monotonic ⇒ discrimination preserved). Decision: use calibrated probabilities for thresholding and test reporting. Parameters + before/after metrics: configs/calibration.json. Reliability diagrams: results/figures/{valid,test}_calibration.png.

Threshold selection decision

On the validation set, using calibrated probabilities, the primary threshold was the highest-specificity threshold achieving sensitivity ≥ 0.95 (sensitivity- prioritized, clinically motivated). Target was achievable.

Locked threshold = 0.7113 → validation sensitivity 0.952, specificity 0.896.
Secondary reference (Youden's J): coincided at 0.7113 here.

Threshold locked before the test set was evaluated. Config: configs/threshold.json.

Final locked threshold

0.7113139 (on calibrated malignancy probability).

Final test results with 95% CIs

Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified bootstrap, 2000 resamples, seed=42.

Metric	Point	95% CI
AUROC	0.9371	[0.9202, 0.9528]
Sensitivity	0.9042	[0.8824, 0.9248]
Specificity	0.7955	[0.7435, 0.8439]
PPV	0.9232	[0.9054, 0.9401]
NPV	0.7535	[0.7123, 0.7979]
Accuracy	0.8750	[0.8540, 0.8950]
F1	0.9136	[0.8991, 0.9278]
Brier	0.0823	—
ECE	0.0314	—

Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661. Tables: results/tables/test_metrics_with_ci.{md,csv}; per-image predictions: results/{valid,test}_predictions.csv.

Failed / weaker runs

No run errored (all rc=0). Weakest configurations: medical_strong augmentation (0.9609) and flip_only (0.9637) — both under the medical_default baseline, confirming the augmentation policy choice. Earlier trackio smoke runs (smoke_trackio_test*, connectivity_check, dataset_pin_check) are infrastructure-validation runs, not experiments. Early Trackio Space creation produced transient 401 /volumes warnings until a persistent dataset_id (Johnyquest7/Trakio_agentic_thyroid_dataset) was pinned; resolved thereafter.

Limitations

Single-source dataset; cropped-ROI inputs; mild class imbalance.
The ≥0.95-sensitivity operating point set on validation yielded 0.904 sensitivity on test — the operating point does not transfer perfectly; ~10% of malignant nodules are missed at the locked threshold. Local threshold re-calibration is advisable before any use.
Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the available signal but cannot exclude same-patient/near-duplicate leakage if such structure exists upstream in TN5000.

External validation — NOT yet performed

No external/independent dataset has been evaluated. evaluate_external.py is provided to run the locked model (same preprocessing, calibration T, and locked threshold) on a future external set (folder or CSV format). External, ideally prospective and multi-site, validation is required before any clinical use.

Test-set integrity statement

The Test split was evaluated exactly once, and only after the model was selected (by validation AUROC), calibrated (temperature scaling on validation), and the decision threshold was locked (on validation). No hyperparameter, calibration, or threshold decision used the test set.