File size: 9,047 Bytes
45af8e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# Experiment Log β€” Agentic Thyroid ResNet-18

Chronological, decision-by-decision record for reproducibility and journal review.

## Provenance

- **Experiment date (UTC):** 2026-06-05
- **Dataset:** `Johnyquest7/TN5000-thyroid-nodule-classification`
  - Commit SHA: `73d6c0713a89c8c07125fe6ffc5956be60e9853d`
  - Loaded **directly from the Train/Valid/Test folder structure** (NOT the
    datasets-viewer flattened `train` config, which merges all 5,000 rows),
    so the predefined splits are respected.
- **Compute:** Hugging Face GPU sandbox, NVIDIA A10G (24 GB), CUDA 13.0, cuDNN 9.2.
- **Key packages:** torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
  scikit-learn 1.9.0, numpy 2.4.6, trackio 0.26.0 (see `configs/env_info.json`).
- **Global seed:** 42. Strict determinism: `torch.use_deterministic_algorithms(True)`,
  cuDNN deterministic, `CUBLAS_WORKSPACE_CONFIG=:4096:8`, seeded DataLoader workers.
- **Positive class:** Malignant (label 1).
- **Experiment tracking:** Trackio project `agentic_thyroid_resnet18`, dashboard
  Space `Johnyquest7/Trakio_agentic_thyroid`, storage dataset
  `Johnyquest7/Trakio_agentic_thyroid_dataset`.

## Exact split usage

| Split | n | Use |
|-------|--:|-----|
| Train | 3,500 | **Training only.** |
| Valid | 500 | **Model selection (val AUROC), calibration, threshold selection.** |
| Test | 1,000 | **Final locked evaluation, exactly once**, after model+calibration+threshold were frozen. |

## Class distribution

| Split | Benign | Malignant | Malignant % | Ratio (M:B) |
|-------|-------:|----------:|------------:|------------:|
| Train | 1,032 | 2,468 | 70.5% | 2.39 : 1 |
| Valid |   125 |   375 | 75.0% | 3.00 : 1 |
| Test  |   269 |   731 | 73.1% | 2.72 : 1 |

Data audit (`data_exploration_report.md`): **0 corrupt images**, all 224Γ—224 RGB
PNG; **0 cross-split exact-pixel duplicates**, **0 filename-ID overlaps**, **0
label conflicts**; per-split mean intensity β‰ˆ 81.4 (std β‰ˆ 19.4) β€” no distribution
shift. Conclusion: **no detectable leakage; splits are clean and separate.**

## Literature-informed augmentation rationale

Augmentations were restricted to medically plausible B-mode ultrasound transforms
(MediAug arXiv:2504.18983 + thyroid-US practice):
- **Kept:** horizontal flip (thyroid is bilaterally symmetric), small rotation
  (≀10Β°), mild affine translate (5%) / scale (0.9–1.1), mild brightness/contrast
  (Β±15%, simulates gain/TGC), light Gaussian blur, and (in the `medical_strong`
  ablation) narrow random-resized-crop (0.8–1.0) + mild speckle noise.
- **Explicitly avoided:** vertical flip (US depth axis is physically meaningful),
  large rotation/shear (distorts taller-than-wide / margin morphology β€” TI-RADS
  malignancy cues), aggressive crop (<0.8, can remove the nodule), and any
  color/HSV jitter (images are grayscale).

Ablation result (val AUROC): `medical_default` **0.9712–0.9756** > `flip_only`
**0.9637** > `medical_strong` **0.9609**. The literature-default policy won.

## Model variants tried

- `torchvision` ResNet-18 (ImageNet1K_V1, bilinear/256β†’224 preprocessing).
- `timm:resnet18.a1_in1k` (A1 recipe, bicubic, crop_pct 0.95) β€” **selected**.
- `timm:resnet18.a2_in1k` (A2 recipe).
- Fine-tune depth: full fine-tune vs freeze stem+layer1 (`freeze_stage=1`).

## Hyperparameter sweep (14 trials, one-factor-at-a-time around a literature-informed center)

Center: timm a1, lr 2e-4, wd 1e-4, bs 32, `medical_default`, `pos_weight`,
full fine-tune, BCE, AdamW, cosine, ≀40 epochs, early-stop(8). All runs logged to
Trackio. **Selection metric: validation AUROC.** All 14 trials completed (rc=0).

| Rank | Run | Change vs center | Val AUROC | Best epoch |
|-----:|-----|------------------|----------:|-----------:|
| 1 | **c12_loss_focal** | **focal Ξ³=1.0, imbalance=none** | **0.9756** | 6 |
| 2 | c09_imb_none | imbalance=none (BCE) | 0.9739 | 6 |
| 3 | c01_backbone_torchvision | torchvision backbone | 0.9731 | 7 |
| 4 | c03_lr_1e-4 | lr 1e-4 | 0.9721 | 11 |
| 5 | c06_bs_64 | batch size 64 | 0.9717 | 8 |
| 6 | c05_wd_1e-3 | weight decay 1e-3 | 0.9712 | 9 |
| 6 | c00_center_a1 | center config | 0.9712 | 6 |
| 8 | c10_imb_sampler | weighted sampler | 0.9697 | 13 |
| 9 | c02_backbone_a2 | a2 backbone | 0.9693 | 6 |
| 10 | c04_lr_5e-4 | lr 5e-4 | 0.9675 | 9 |
| 11 | c11_freeze1 | freeze stem+layer1 | 0.9672 | 6 |
| 12 | c13_lr1e-4_wd1e-3_drop | lr1e-4+wd1e-3+dropout0.2 | 0.9657 | 11 |
| 13 | c07_aug_flip_only | flip-only aug | 0.9637 | 6 |
| 14 | c08_aug_strong | strong aug | 0.9609 | 8 |

Findings: (1) For this mild (~70/30) imbalance, **focal loss (Ξ³=1.0) and no extra
reweighting beat class-weighted BCE and weighted sampling** β€” heavy reweighting
slightly hurt, consistent with the literature. (2) `medical_default` augmentation
is the sweet spot. (3) **Full fine-tune > freezing.** (4) Backbones were close
(a1 β‰ˆ torchvision β‰ˆ a2). Full per-run details: `results/tables/sweep_leaderboard.json`.

No excessive trial count was used (14 one-factor trials) to avoid overfitting the
500-image validation set.

## Selected run

**c12_loss_focal** β€” `timm:resnet18.a1_in1k`, focal loss (Ξ³=1.0, Ξ±=0.5),
imbalance=none, AdamW lr 2e-4 / wd 1e-4, batch 32, `medical_default` aug, full
fine-tune, cosine schedule, best epoch 6, **validation AUROC 0.9756**. Selected
**purely on validation AUROC**, before any test access. Config:
`configs/final_config.yaml`; weights: `final_model.pt`.

## Calibration decision

Assessed on validation. **Temperature scaling** (single parameter, LBFGS on NLL)
gave **T = 0.5646**. Validation ECE 0.0833 β†’ **0.0308**, Brier 0.0592 β†’ 0.0525,
AUROC unchanged (0.9756; temperature scaling is monotonic β‡’ discrimination
preserved). **Decision: use calibrated probabilities** for thresholding and test
reporting. Parameters + before/after metrics: `configs/calibration.json`.
Reliability diagrams: `results/figures/{valid,test}_calibration.png`.

## Threshold selection decision

On the validation set, using calibrated probabilities, the primary threshold was
the **highest-specificity threshold achieving sensitivity β‰₯ 0.95** (sensitivity-
prioritized, clinically motivated). Target was achievable.

- **Locked threshold = 0.7113** β†’ validation sensitivity **0.952**, specificity **0.896**.
- Secondary reference (Youden's J): coincided at 0.7113 here.

Threshold **locked before** the test set was evaluated. Config: `configs/threshold.json`.

## Final locked threshold

**0.7113139** (on calibrated malignancy probability).

## Final test results with 95% CIs

Test split (n=1000), calibrated probabilities + locked threshold. CIs: stratified
bootstrap, 2000 resamples, seed=42.

| Metric | Point | 95% CI |
|--------|------:|:------:|
| AUROC | 0.9371 | [0.9202, 0.9528] |
| Sensitivity | 0.9042 | [0.8824, 0.9248] |
| Specificity | 0.7955 | [0.7435, 0.8439] |
| PPV | 0.9232 | [0.9054, 0.9401] |
| NPV | 0.7535 | [0.7123, 0.7979] |
| Accuracy | 0.8750 | [0.8540, 0.8950] |
| F1 | 0.9136 | [0.8991, 0.9278] |
| Brier | 0.0823 | β€” |
| ECE | 0.0314 | β€” |

Confusion matrix (Test): TN=214, FP=55, FN=70, TP=661.
Tables: `results/tables/test_metrics_with_ci.{md,csv}`; per-image predictions:
`results/{valid,test}_predictions.csv`.

## Failed / weaker runs

No run errored (all rc=0). Weakest configurations: `medical_strong` augmentation
(0.9609) and `flip_only` (0.9637) β€” both under the `medical_default` baseline,
confirming the augmentation policy choice. Earlier trackio smoke runs
(`smoke_trackio_test*`, `connectivity_check`, `dataset_pin_check`) are
infrastructure-validation runs, not experiments. Early Trackio Space creation
produced transient 401 `/volumes` warnings until a persistent `dataset_id`
(`Johnyquest7/Trakio_agentic_thyroid_dataset`) was pinned; resolved thereafter.

## Limitations

- Single-source dataset; cropped-ROI inputs; mild class imbalance.
- The β‰₯0.95-sensitivity operating point set on validation yielded **0.904**
  sensitivity on test β€” the operating point does not transfer perfectly; ~10% of
  malignant nodules are missed at the locked threshold. Local threshold
  re-calibration is advisable before any use.
- Leakage checks (exact-pixel hash + filename-ID overlap) are exhaustive for the
  available signal but cannot exclude same-patient/near-duplicate leakage if such
  structure exists upstream in TN5000.

## External validation β€” NOT yet performed

No external/independent dataset has been evaluated. `evaluate_external.py` is
provided to run the locked model (same preprocessing, calibration T, and locked
threshold) on a future external set (folder or CSV format). External, ideally
prospective and multi-site, validation is **required** before any clinical use.

## Test-set integrity statement

> The Test split was evaluated **exactly once**, and **only after** the model was
> selected (by validation AUROC), calibrated (temperature scaling on validation),
> and the decision threshold was locked (on validation). No hyperparameter,
> calibration, or threshold decision used the test set.