---
license: cc-by-nc-4.0
tags:
- medical-imaging
- thyroid
- ultrasound
- image-classification
- resnet18
- calibration
- ml-intern
pipeline_tag: image-classification
---

# Agentic Thyroid ResNet-18 — Ultrasound Nodule Malignancy Classifier

> ⚠️ **RESEARCH USE ONLY — NOT FOR CLINICAL USE.** This model is a research
> artifact trained on a single retrospective dataset. It has **not** been
> externally validated and **must not** be used for diagnosis, screening, or any
> clinical decision-making. External, prospective validation is required before
> any clinical consideration.

A ResNet-18 binary classifier that predicts the probability that a cropped
thyroid ultrasound nodule image is **malignant** (positive class) vs **benign**.
Built for a reproducible, publication-oriented experiment with proper
calibration and a sensitivity-prioritized, validation-locked decision threshold.

- **Backbone:** `timm` ResNet-18, A1 ImageNet-1k recipe (`resnet18.a1_in1k`), full fine-tune
- **Selected by:** validation AUROC (14-trial sweep; winner val AUROC **0.9756**)
- **Calibration:** temperature scaling (T = 0.5646), fit on validation
- **Locked threshold:** **0.7113** (highest-specificity threshold with validation sensitivity ≥ 0.95)

## Cite this model 

Thomas J. agentic_thyroid_model [software]. Revision e4f94ea. Hugging Face; 2026. doi:10.57967/hf/9282

## Intended use

- **Intended:** methodological research, benchmarking, and as a baseline for
  thyroid ultrasound malignancy classification studies.
- **Out of scope:** any clinical, diagnostic, triage, or screening use; use on
  images acquired/preprocessed differently from the training data without
  re-validation; use on non-thyroid or non-ultrasound images.

## Dataset

- **Source:** [`Johnyquest7/TN5000-thyroid-nodule-classification`](https://huggingface.co/datasets/Johnyquest7/TN5000-thyroid-nodule-classification)
  (derived from TN5000; nodule ROI cropped to 224×224 RGB PNG).
- **Splits (kept strictly separate):** Train 3,500 · Valid 500 · Test 1,000.
- **Labels:** `0 = Benign`, `1 = Malignant` (positive class = Malignant).
- **Class balance:** ~70–75% malignant in every split (mild imbalance). See
  [`data_exploration_report.md`](data_exploration_report.md): **0 corrupt images,
  0 cross-split pixel duplicates, 0 filename-ID overlaps** → no detectable leakage.

## Label definitions

| Label | Class | Meaning |
|------:|-------|---------|
| 0 | Benign | Non-malignant thyroid nodule |
| 1 | Malignant | Malignant thyroid nodule (positive class) |

## Preprocessing (locked — `configs/preprocess.json`)

Deterministic eval/inference path (no augmentation):
1. Resize to **224×224** (bicubic; the timm A1 data config).
2. `ToTensor()`.
3. Normalize with ImageNet mean `[0.485, 0.456, 0.406]`, std `[0.229, 0.224, 0.225]`.

Grayscale ultrasound images are loaded as 3-channel RGB.

## Model architecture

ResNet-18 with a single-logit binary head (`num_classes=1`); sigmoid → probability
of malignancy. ~11.2M parameters. Trained from ImageNet-1k weights (full fine-tune).

## Training procedure (final / winning config)

| Setting | Value |
|---|---|
| Backbone | `timm:resnet18.a1_in1k`, full fine-tune |
| Loss | Focal loss (γ=1.0, α=0.5) |
| Class-imbalance handling | none beyond focal (mild imbalance; preserves calibration) |
| Augmentation | `medical_default`: hflip, mild affine (rot ≤10°, translate 5%, scale 0.9–1.1), mild brightness/contrast (±15%), light Gaussian blur (p=0.2) |
| Optimizer | AdamW, lr 2e-4, weight decay 1e-4 |
| Scheduler | Cosine annealing, 2-epoch warmup |
| Batch size | 32 |
| Epochs | ≤40, early stopping on val AUROC (patience 8) |
| Mixed precision | yes (fp16 autocast) |
| Seed | 42 (strict determinism; `CUBLAS_WORKSPACE_CONFIG=:4096:8`, cuDNN deterministic) |
| Best epoch | 6 |

Augmentations were chosen to be medically plausible for B-mode ultrasound
(no vertical flip, no large rotation/shear, no aggressive crop, no color/HSV
jitter), informed by MediAug (arXiv:2504.18983) and thyroid-US best practice.
The augmentation ablation confirmed `medical_default` (0.9712–0.9756 val AUROC)
outperforms both `flip_only` (0.9637) and `medical_strong` (0.9609).

Environment: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
scikit-learn 1.9.0, numpy 2.4.6, NVIDIA A10G (CUDA 13.0, cuDNN 9.2).
Full versions in `configs/env_info.json`.

## Validation threshold strategy

After selecting the model by validation AUROC and calibrating on validation, the
decision threshold was chosen on the **validation set** as the **highest-specificity
threshold achieving sensitivity ≥ 0.95** (clinically sensitivity-prioritized).
Youden's J is reported as a secondary reference. The threshold (**0.7113**) was
**locked before** the test set was touched.

- Validation @ locked threshold: sensitivity **0.952**, specificity **0.896**.


## Calibration results

Temperature scaling (T=0.5646) on validation reduced **validation ECE from 0.0833
→ 0.0308** and Brier from 0.0592 → 0.0525 with AUROC unchanged (monotonic). On
the test set, calibrated ECE is **0.0314** (well calibrated). Reliability diagrams:
`results/figures/valid_calibration.png`, `results/figures/test_calibration.png`.


## Files in this repo

```
final_model.pt                 # locked weights + backbone/preprocess metadata
configs/final_config.yaml      # single-command training config
configs/preprocess.json        # locked preprocessing
configs/calibration.json       # temperature scaling parameter + before/after metrics
configs/threshold.json         # locked decision threshold + selection method
configs/env_info.json          # exact package/hardware/CUDA versions
train.py  evaluate.py  evaluate_external.py  explore_data.py  sweep.py  finalize.py
thyroid_lib.py                 # shared preprocessing/model/calibration/metrics
requirements.txt
data_exploration_report.md     # full data audit (counts, leakage, intensity, grids)
LOG.md                         # chronological experiment log
results/                       # figures, tables (incl. CIs + sweep leaderboard), per-image CSVs. Not with TTA
```

## Reproduce

```bash
pip install -r requirements.txt
python explore_data.py --dataset_id Johnyquest7/TN5000-thyroid-nodule-classification
python train.py    --config configs/final_config.yaml
python evaluate.py --split test --config configs/final_config.yaml
```

Evaluate a future external dataset with the same preprocessing/calibration/threshold:

```bash
python evaluate_external.py \
  --model_repo Johnyquest7/agentic_thyroid_model \
  --data_dir /path/to/external_dataset \
  --output_dir external_results
# external dataset = folder with Benign/ Malignant/ subfolders, OR add --csv labels.csv
```

## Limitations, bias, and leakage concerns

- **Single-source dataset; no external validation.** Performance on data from
  other scanners, institutions, or populations is unknown and likely lower.
- **Cropped-ROI inputs.** The model expects nodule-cropped images like TN5000;
  whole-frame ultrasound will be out of distribution.
- **Sensitivity gap at deployment threshold.** The threshold targets ≥0.95
  sensitivity on validation; on the test set sensitivity was 0.904 — i.e. the
  operating point does not perfectly transfer, and ~10% of malignant nodules
  were missed at this threshold. Threshold re-calibration on local data is
  advisable before any use.
- **Label/selection bias.** Labels and cohort composition reflect the source
  dataset's referral and pathology-confirmation process.
- **Leakage checks were exhaustive within the available signal** (exact pixel
  hashing + filename-ID overlap, all zero) but cannot rule out near-duplicate or
  same-patient-different-image leakage if such structure exists in the source.

## ⚠️ External validation required before clinical use

This model **requires independent, ideally prospective, multi-site external
validation** before any clinical consideration. It is released for research
reproducibility only.

## Citation

Dataset: TN5000 (Yu et al., *Scientific Data*, 2025).
Augmentation guidance: MediAug, arXiv:2504.18983.

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Johnyquest7/agentic_thyroid_model'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.