Agentic Thyroid ResNet-18 — Ultrasound Nodule Malignancy Classifier

⚠️ RESEARCH USE ONLY — NOT FOR CLINICAL USE. This model is a research artifact trained on a single retrospective dataset. It has not been externally validated and must not be used for diagnosis, screening, or any clinical decision-making. External, prospective validation is required before any clinical consideration.

A ResNet-18 binary classifier that predicts the probability that a cropped thyroid ultrasound nodule image is malignant (positive class) vs benign. Built for a reproducible, publication-oriented experiment with proper calibration and a sensitivity-prioritized, validation-locked decision threshold.

Backbone: timm ResNet-18, A1 ImageNet-1k recipe (resnet18.a1_in1k), full fine-tune
Selected by: validation AUROC (14-trial sweep; winner val AUROC 0.9756)
Calibration: temperature scaling (T = 0.5646), fit on validation
Locked threshold: 0.7113 (highest-specificity threshold with validation sensitivity ≥ 0.95)

Cite this model

Thomas J. agentic_thyroid_model [software]. Revision e4f94ea. Hugging Face; 2026. doi:10.57967/hf/9282

Intended use

Intended: methodological research, benchmarking, and as a baseline for thyroid ultrasound malignancy classification studies.
Out of scope: any clinical, diagnostic, triage, or screening use; use on images acquired/preprocessed differently from the training data without re-validation; use on non-thyroid or non-ultrasound images.

Dataset

Source: Johnyquest7/TN5000-thyroid-nodule-classification (derived from TN5000; nodule ROI cropped to 224×224 RGB PNG).
Splits (kept strictly separate): Train 3,500 · Valid 500 · Test 1,000.
Labels: 0 = Benign, 1 = Malignant (positive class = Malignant).
Class balance: ~70–75% malignant in every split (mild imbalance). See data_exploration_report.md: 0 corrupt images, 0 cross-split pixel duplicates, 0 filename-ID overlaps → no detectable leakage.

Label definitions

Label	Class	Meaning
0	Benign	Non-malignant thyroid nodule
1	Malignant	Malignant thyroid nodule (positive class)

Preprocessing (locked — `configs/preprocess.json`)

Deterministic eval/inference path (no augmentation):

Resize to 224×224 (bicubic; the timm A1 data config).
ToTensor().
Normalize with ImageNet mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225].

Grayscale ultrasound images are loaded as 3-channel RGB.

Model architecture

ResNet-18 with a single-logit binary head (num_classes=1); sigmoid → probability of malignancy. ~11.2M parameters. Trained from ImageNet-1k weights (full fine-tune).

Training procedure (final / winning config)

Setting	Value
Backbone	`timm:resnet18.a1_in1k`, full fine-tune
Loss	Focal loss (γ=1.0, α=0.5)
Class-imbalance handling	none beyond focal (mild imbalance; preserves calibration)
Augmentation	`medical_default`: hflip, mild affine (rot ≤10°, translate 5%, scale 0.9–1.1), mild brightness/contrast (±15%), light Gaussian blur (p=0.2)
Optimizer	AdamW, lr 2e-4, weight decay 1e-4
Scheduler	Cosine annealing, 2-epoch warmup
Batch size	32
Epochs	≤40, early stopping on val AUROC (patience 8)
Mixed precision	yes (fp16 autocast)
Seed	42 (strict determinism; `CUBLAS_WORKSPACE_CONFIG=:4096:8`, cuDNN deterministic)
Best epoch	6

Augmentations were chosen to be medically plausible for B-mode ultrasound (no vertical flip, no large rotation/shear, no aggressive crop, no color/HSV jitter), informed by MediAug (arXiv:2504.18983) and thyroid-US best practice. The augmentation ablation confirmed medical_default (0.9712–0.9756 val AUROC) outperforms both flip_only (0.9637) and medical_strong (0.9609).

Environment: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, scikit-learn 1.9.0, numpy 2.4.6, NVIDIA A10G (CUDA 13.0, cuDNN 9.2). Full versions in configs/env_info.json.

Validation threshold strategy

After selecting the model by validation AUROC and calibrating on validation, the decision threshold was chosen on the validation set as the highest-specificity threshold achieving sensitivity ≥ 0.95 (clinically sensitivity-prioritized). Youden's J is reported as a secondary reference. The threshold (0.7113) was locked before the test set was touched.

Validation @ locked threshold: sensitivity 0.952, specificity 0.896.

Calibration results

Temperature scaling (T=0.5646) on validation reduced validation ECE from 0.0833 → 0.0308 and Brier from 0.0592 → 0.0525 with AUROC unchanged (monotonic). On the test set, calibrated ECE is 0.0314 (well calibrated). Reliability diagrams: results/figures/valid_calibration.png, results/figures/test_calibration.png.

Files in this repo

final_model.pt                 # locked weights + backbone/preprocess metadata
configs/final_config.yaml      # single-command training config
configs/preprocess.json        # locked preprocessing
configs/calibration.json       # temperature scaling parameter + before/after metrics
configs/threshold.json         # locked decision threshold + selection method
configs/env_info.json          # exact package/hardware/CUDA versions
train.py  evaluate.py  evaluate_external.py  explore_data.py  sweep.py  finalize.py
thyroid_lib.py                 # shared preprocessing/model/calibration/metrics
requirements.txt
data_exploration_report.md     # full data audit (counts, leakage, intensity, grids)
LOG.md                         # chronological experiment log
results/                       # figures, tables (incl. CIs + sweep leaderboard), per-image CSVs. Not with TTA

Reproduce

pip install -r requirements.txt
python explore_data.py --dataset_id Johnyquest7/TN5000-thyroid-nodule-classification
python train.py    --config configs/final_config.yaml
python evaluate.py --split test --config configs/final_config.yaml

Evaluate a future external dataset with the same preprocessing/calibration/threshold:

python evaluate_external.py \
  --model_repo Johnyquest7/agentic_thyroid_model \
  --data_dir /path/to/external_dataset \
  --output_dir external_results
# external dataset = folder with Benign/ Malignant/ subfolders, OR add --csv labels.csv

Limitations, bias, and leakage concerns

Single-source dataset; no external validation. Performance on data from other scanners, institutions, or populations is unknown and likely lower.
Cropped-ROI inputs. The model expects nodule-cropped images like TN5000; whole-frame ultrasound will be out of distribution.
Sensitivity gap at deployment threshold. The threshold targets ≥0.95 sensitivity on validation; on the test set sensitivity was 0.904 — i.e. the operating point does not perfectly transfer, and ~10% of malignant nodules were missed at this threshold. Threshold re-calibration on local data is advisable before any use.
Label/selection bias. Labels and cohort composition reflect the source dataset's referral and pathology-confirmation process.
Leakage checks were exhaustive within the available signal (exact pixel hashing + filename-ID overlap, all zero) but cannot rule out near-duplicate or same-patient-different-image leakage if such structure exists in the source.

⚠️ External validation required before clinical use

This model requires independent, ideally prospective, multi-site external validation before any clinical consideration. It is released for research reproducibility only.

Citation

Dataset: TN5000 (Yu et al., Scientific Data, 2025). Augmentation guidance: MediAug, arXiv:2504.18983.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Johnyquest7/agentic_thyroid_model'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for Johnyquest7/agentic_thyroid_model

MediAug: Exploring Visual Augmentation in Medical Imaging

Paper • 2504.18983 • Published Apr 26, 2025 • 7