Agentic Thyroid ResNet-18 — Ultrasound Nodule Malignancy Classifier

⚠️ RESEARCH USE ONLY — NOT FOR CLINICAL USE. This model is a research artifact trained on a single retrospective dataset. It has not been externally validated and must not be used for diagnosis, screening, or any clinical decision-making. External, prospective validation is required before any clinical consideration.

A ResNet-18 binary classifier that predicts the probability that a cropped thyroid ultrasound nodule image is malignant (positive class) vs benign. Built for a reproducible, publication-oriented experiment with proper calibration and a sensitivity-prioritized, validation-locked decision threshold.

  • Backbone: timm ResNet-18, A1 ImageNet-1k recipe (resnet18.a1_in1k), full fine-tune
  • Selected by: validation AUROC (14-trial sweep; winner val AUROC 0.9756)
  • Calibration: temperature scaling (T = 0.5646), fit on validation
  • Locked threshold: 0.7113 (highest-specificity threshold with validation sensitivity ≥ 0.95)

Cite this model

Thomas J. agentic_thyroid_model [software]. Revision e4f94ea. Hugging Face; 2026. doi:10.57967/hf/9282

Intended use

  • Intended: methodological research, benchmarking, and as a baseline for thyroid ultrasound malignancy classification studies.
  • Out of scope: any clinical, diagnostic, triage, or screening use; use on images acquired/preprocessed differently from the training data without re-validation; use on non-thyroid or non-ultrasound images.

Dataset

  • Source: Johnyquest7/TN5000-thyroid-nodule-classification (derived from TN5000; nodule ROI cropped to 224×224 RGB PNG).
  • Splits (kept strictly separate): Train 3,500 · Valid 500 · Test 1,000.
  • Labels: 0 = Benign, 1 = Malignant (positive class = Malignant).
  • Class balance: ~70–75% malignant in every split (mild imbalance). See data_exploration_report.md: 0 corrupt images, 0 cross-split pixel duplicates, 0 filename-ID overlaps → no detectable leakage.

Label definitions

Label Class Meaning
0 Benign Non-malignant thyroid nodule
1 Malignant Malignant thyroid nodule (positive class)

Preprocessing (locked — configs/preprocess.json)

Deterministic eval/inference path (no augmentation):

  1. Resize to 224×224 (bicubic; the timm A1 data config).
  2. ToTensor().
  3. Normalize with ImageNet mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225].

Grayscale ultrasound images are loaded as 3-channel RGB.

Model architecture

ResNet-18 with a single-logit binary head (num_classes=1); sigmoid → probability of malignancy. ~11.2M parameters. Trained from ImageNet-1k weights (full fine-tune).

Training procedure (final / winning config)

Setting Value
Backbone timm:resnet18.a1_in1k, full fine-tune
Loss Focal loss (γ=1.0, α=0.5)
Class-imbalance handling none beyond focal (mild imbalance; preserves calibration)
Augmentation medical_default: hflip, mild affine (rot ≤10°, translate 5%, scale 0.9–1.1), mild brightness/contrast (±15%), light Gaussian blur (p=0.2)
Optimizer AdamW, lr 2e-4, weight decay 1e-4
Scheduler Cosine annealing, 2-epoch warmup
Batch size 32
Epochs ≤40, early stopping on val AUROC (patience 8)
Mixed precision yes (fp16 autocast)
Seed 42 (strict determinism; CUBLAS_WORKSPACE_CONFIG=:4096:8, cuDNN deterministic)
Best epoch 6

Augmentations were chosen to be medically plausible for B-mode ultrasound (no vertical flip, no large rotation/shear, no aggressive crop, no color/HSV jitter), informed by MediAug (arXiv:2504.18983) and thyroid-US best practice. The augmentation ablation confirmed medical_default (0.9712–0.9756 val AUROC) outperforms both flip_only (0.9637) and medical_strong (0.9609).

Environment: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, scikit-learn 1.9.0, numpy 2.4.6, NVIDIA A10G (CUDA 13.0, cuDNN 9.2). Full versions in configs/env_info.json.

Validation threshold strategy

After selecting the model by validation AUROC and calibrating on validation, the decision threshold was chosen on the validation set as the highest-specificity threshold achieving sensitivity ≥ 0.95 (clinically sensitivity-prioritized). Youden's J is reported as a secondary reference. The threshold (0.7113) was locked before the test set was touched.

  • Validation @ locked threshold: sensitivity 0.952, specificity 0.896.

Calibration results

Temperature scaling (T=0.5646) on validation reduced validation ECE from 0.0833 → 0.0308 and Brier from 0.0592 → 0.0525 with AUROC unchanged (monotonic). On the test set, calibrated ECE is 0.0314 (well calibrated). Reliability diagrams: results/figures/valid_calibration.png, results/figures/test_calibration.png.

Files in this repo

final_model.pt                 # locked weights + backbone/preprocess metadata
configs/final_config.yaml      # single-command training config
configs/preprocess.json        # locked preprocessing
configs/calibration.json       # temperature scaling parameter + before/after metrics
configs/threshold.json         # locked decision threshold + selection method
configs/env_info.json          # exact package/hardware/CUDA versions
train.py  evaluate.py  evaluate_external.py  explore_data.py  sweep.py  finalize.py
thyroid_lib.py                 # shared preprocessing/model/calibration/metrics
requirements.txt
data_exploration_report.md     # full data audit (counts, leakage, intensity, grids)
LOG.md                         # chronological experiment log
results/                       # figures, tables (incl. CIs + sweep leaderboard), per-image CSVs. Not with TTA

Reproduce

pip install -r requirements.txt
python explore_data.py --dataset_id Johnyquest7/TN5000-thyroid-nodule-classification
python train.py    --config configs/final_config.yaml
python evaluate.py --split test --config configs/final_config.yaml

Evaluate a future external dataset with the same preprocessing/calibration/threshold:

python evaluate_external.py \
  --model_repo Johnyquest7/agentic_thyroid_model \
  --data_dir /path/to/external_dataset \
  --output_dir external_results
# external dataset = folder with Benign/ Malignant/ subfolders, OR add --csv labels.csv

Limitations, bias, and leakage concerns

  • Single-source dataset; no external validation. Performance on data from other scanners, institutions, or populations is unknown and likely lower.
  • Cropped-ROI inputs. The model expects nodule-cropped images like TN5000; whole-frame ultrasound will be out of distribution.
  • Sensitivity gap at deployment threshold. The threshold targets ≥0.95 sensitivity on validation; on the test set sensitivity was 0.904 — i.e. the operating point does not perfectly transfer, and ~10% of malignant nodules were missed at this threshold. Threshold re-calibration on local data is advisable before any use.
  • Label/selection bias. Labels and cohort composition reflect the source dataset's referral and pathology-confirmation process.
  • Leakage checks were exhaustive within the available signal (exact pixel hashing + filename-ID overlap, all zero) but cannot rule out near-duplicate or same-patient-different-image leakage if such structure exists in the source.

⚠️ External validation required before clinical use

This model requires independent, ideally prospective, multi-site external validation before any clinical consideration. It is released for research reproducibility only.

Citation

Dataset: TN5000 (Yu et al., Scientific Data, 2025). Augmentation guidance: MediAug, arXiv:2504.18983.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Johnyquest7/agentic_thyroid_model'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Johnyquest7/agentic_thyroid_model