Johnyquest7

Add full reproducible thyroid ResNet-18 experiment: weights, scripts, configs, calibration, locked threshold, test eval w/ CIs, figures, data exploration, README, LOG

45af8e1 verified 23 days ago

preview code

Raw

History Blame Contribute Delete

9.05 kB

	# Experiment Log — Agentic Thyroid ResNet-18

	Chronological, decision-by-decision record for reproducibility and journal review.

	## Provenance

	- Experiment date (UTC): 2026-06-05
	- Dataset: `Johnyquest7/TN5000-thyroid-nodule-classification`
	- Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d`
	- Loaded directly from the Train/Valid/Test folder structure (NOT the
	datasets-viewer flattened `train` config, which merges all 5,000 rows),
	so the predefined splits are respected.
	- Compute: Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
	- Key packages: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
	scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`).
	- Global seed: 42. Strict determinism: `torch.use_deterministic_algorithms(True)`,
	cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers.
	- Positive class: Malignant (label 1).
	- Experiment tracking: Trackio project `agentic_thyroid_resnet18`, dashboard
	Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset
	`Johnyquest7/Trakio_agentic_thyroid_dataset`.

	## Exact split usage

	\| Split \| n \| Use \|
	\|-------\|--:\|-----\|
	\| Train \| 3,500 \| Training only. \|
	\| Valid \| 500 \| Model selection (val AUROC), calibration, threshold selection. \|
	\| Test \| 1,000 \| Final locked evaluation, exactly once, after model+calibration+threshold were frozen. \|

	## Class distribution

	\| Split \| Benign \| Malignant \| Malignant % \| Ratio (M:B) \|
	\|-------\|-------:\|----------:\|------------:\|------------:\|
	\| Train \| 1,032 \| 2,468 \| 70.5% \| 2.39 : 1 \|
	\| Valid \| 125 \| 375 \| 75.0% \| 3.00 : 1 \|
	\| Test \| 269 \| 731 \| 73.1% \| 2.72 : 1 \|

	Data audit (`data_exploration_report.md`): 0 corrupt images, all 224×224 RGB
	PNG; 0 cross-split exact-pixel duplicates, 0 filename-ID overlaps, **0
	label conflicts**; per-split mean intensity ≈ 81.4 (std ≈ 19.4) — no distribution
	shift. Conclusion: no detectable leakage; splits are clean and separate.

	## Literature-informed augmentation rationale

	Augmentations were restricted to medically plausible B-mode ultrasound transforms
	(MediAug arXiv:2504.18983 + thyroid-US practice):
	- Kept: horizontal flip (thyroid is bilaterally symmetric), small rotation
	(≤10°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast
	(±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong`
	ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
	- Explicitly avoided: vertical flip (US depth axis is physically meaningful),
	large rotation/shear (distorts taller-than-wide / margin morphology — TI-RADS
	malignancy cues), aggressive crop (<0.8, can remove the nodule), and any
	color/HSV jitter (images are grayscale).

	Ablation result (val AUROC): `medical_default` 0.9712–0.9756 > `flip_only`
	0.9637 > `medical_strong` 0.9609. The literature-default policy won.

	## Model variants tried

	- `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256→224 preprocessing).
	- `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) — selected.
	- `timm:resnet18.a2_in1k` (A2 recipe).
	- Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`).

	## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)

	Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`,
	full fine-tune, BCE, AdamW, cosine, ≤40 epochs, early-stop(8). All runs logged to
	Trackio. Selection metric: validation AUROC. All 14 trials completed (rc=0).

	\| Rank \| Run \| Change vs center \| Val AUROC \| Best epoch \|
	\|-----:\|-----\|------------------\|----------:\|-----------:\|
	\| 1 \| c12_loss_focal \| focal γ=1.0, imbalance=none \| 0.9756 \| 6 \|
	\| 2 \| c09_imb_none \| imbalance=none (BCE) \| 0.9739 \| 6 \|
	\| 3 \| c01_backbone_torchvision \| torchvision backbone \| 0.9731 \| 7 \|
	\| 4 \| c03_lr_1e-4 \| lr 1e-4 \| 0.9721 \| 11 \|
	\| 5 \| c06_bs_64 \| batch size 64 \| 0.9717 \| 8 \|
	\| 6 \| c05_wd_1e-3 \| weight decay 1e-3 \| 0.9712 \| 9 \|
	\| 6 \| c00_center_a1 \| center config \| 0.9712 \| 6 \|
	\| 8 \| c10_imb_sampler \| weighted sampler \| 0.9697 \| 13 \|
	\| 9 \| c02_backbone_a2 \| a2 backbone \| 0.9693 \| 6 \|
	\| 10 \| c04_lr_5e-4 \| lr 5e-4 \| 0.9675 \| 9 \|
	\| 11 \| c11_freeze1 \| freeze stem+layer1 \| 0.9672 \| 6 \|
	\| 12 \| c13_lr1e-4_wd1e-3_drop \| lr1e-4+wd1e-3+dropout0.2 \| 0.9657 \| 11 \|
	\| 13 \| c07_aug_flip_only \| flip-only aug \| 0.9637 \| 6 \|
	\| 14 \| c08_aug_strong \| strong aug \| 0.9609 \| 8 \|

	Findings: (1) For this mild (~70/30) imbalance, **focal loss (γ=1.0) and no extra
	reweighting beat class-weighted BCE and weighted sampling** — heavy reweighting
	slightly hurt, consistent with the literature. (2) `medical_default` augmentation
	is the sweet spot. (3) Full fine-tune > freezing. (4) Backbones were close
	(a1 ≈ torchvision ≈ a2). Full per-run details: `results/tables/sweep_leaderboard.json`.

	No excessive trial count was used (14 one-factor trials) to avoid overfitting the
	500-image validation set.

	## Selected run

	c12_loss_focal — `timm:resnet18.a1_in1k`, focal loss (γ=1.0, α=0.5),
	imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full
	fine-tune, cosine schedule, best epoch 6, validation AUROC 0.9756. Selected
	purely on validation AUROC, before any test access. Config:
	`configs/final_config.yaml`; weights: `final_model.pt`.

	## Calibration decision

	Assessed on validation. Temperature scaling (single parameter, LBFGS on NLL)
	gave T = 0.5646. Validation ECE 0.0833 → 0.0308, Brier 0.0592 → 0.0525,
	AUROC unchanged (0.9756; temperature scaling is monotonic ⇒ discrimination
	preserved). Decision: use calibrated probabilities for thresholding and test
	reporting. Parameters + before/after metrics: `configs/calibration.json`.
	Reliability diagrams: `results/figures/{valid,test}_calibration.png`.

	## Threshold selection decision

	On the validation set, using calibrated probabilities, the primary threshold was
	the highest-specificity threshold achieving sensitivity ≥ 0.95 (sensitivity-
	prioritized, clinically motivated). Target was achievable.

	- Locked threshold = 0.7113 → validation sensitivity 0.952, specificity 0.896.
	- Secondary reference (Youden's J): coincided at 0.7113 here.

	Threshold locked before the test set was evaluated. Config: `configs/threshold.json`.

	## Final locked threshold

	0.7113139 (on calibrated malignancy probability).

	## Final test results with 95% CIs

	Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified
	bootstrap, 2000 resamples, seed=42.

	\| Metric \| Point \| 95% CI \|
	\|--------\|------:\|:------:\|
	\| AUROC \| 0.9371 \| [0.9202, 0.9528] \|
	\| Sensitivity \| 0.9042 \| [0.8824, 0.9248] \|
	\| Specificity \| 0.7955 \| [0.7435, 0.8439] \|
	\| PPV \| 0.9232 \| [0.9054, 0.9401] \|
	\| NPV \| 0.7535 \| [0.7123, 0.7979] \|
	\| Accuracy \| 0.8750 \| [0.8540, 0.8950] \|
	\| F1 \| 0.9136 \| [0.8991, 0.9278] \|
	\| Brier \| 0.0823 \| — \|
	\| ECE \| 0.0314 \| — \|

	Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
	Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions:
	`results/{valid,test}_predictions.csv`.

	## Failed / weaker runs

	No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation
	(0.9609) and `flip_only` (0.9637) — both under the `medical_default` baseline,
	confirming the augmentation policy choice. Earlier trackio smoke runs
	(`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are
	infrastructure-validation runs, not experiments. Early Trackio Space creation
	produced transient 401 `/volumes` warnings until a persistent `dataset_id`
	(`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter.

	## Limitations

	- Single-source dataset; cropped-ROI inputs; mild class imbalance.
	- The ≥0.95-sensitivity operating point set on validation yielded 0.904
	sensitivity on test — the operating point does not transfer perfectly; ~10% of
	malignant nodules are missed at the locked threshold. Local threshold
	re-calibration is advisable before any use.
	- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the
	available signal but cannot exclude same-patient/near-duplicate leakage if such
	structure exists upstream in TN5000.

	## External validation — NOT yet performed

	No external/independent dataset has been evaluated. `evaluate_external.py` is
	provided to run the locked model (same preprocessing, calibration T, and locked
	threshold) on a future external set (folder or CSV format). External, ideally
	prospective and multi-site, validation is required before any clinical use.

	## Test-set integrity statement

	> The Test split was evaluated exactly once, and only after the model was
	> selected (by validation AUROC), calibrated (temperature scaling on validation),
	> and the decision threshold was locked (on validation). No hyperparameter,
	> calibration, or threshold decision used the test set.