| --- |
| license: cc-by-nc-4.0 |
| tags: |
| - medical-imaging |
| - thyroid |
| - ultrasound |
| - image-classification |
| - resnet18 |
| - calibration |
| - ml-intern |
| pipeline_tag: image-classification |
| --- |
| |
| # Agentic Thyroid ResNet-18 — Ultrasound Nodule Malignancy Classifier |
|
|
| > ⚠️ **RESEARCH USE ONLY — NOT FOR CLINICAL USE.** This model is a research |
| > artifact trained on a single retrospective dataset. It has **not** been |
| > externally validated and **must not** be used for diagnosis, screening, or any |
| > clinical decision-making. External, prospective validation is required before |
| > any clinical consideration. |
|
|
| A ResNet-18 binary classifier that predicts the probability that a cropped |
| thyroid ultrasound nodule image is **malignant** (positive class) vs **benign**. |
| Built for a reproducible, publication-oriented experiment with proper |
| calibration and a sensitivity-prioritized, validation-locked decision threshold. |
|
|
| - **Backbone:** `timm` ResNet-18, A1 ImageNet-1k recipe (`resnet18.a1_in1k`), full fine-tune |
| - **Selected by:** validation AUROC (14-trial sweep; winner val AUROC **0.9756**) |
| - **Calibration:** temperature scaling (T = 0.5646), fit on validation |
| - **Locked threshold:** **0.7113** (highest-specificity threshold with validation sensitivity ≥ 0.95) |
|
|
| ## Cite this model |
|
|
| Thomas J. agentic_thyroid_model [software]. Revision e4f94ea. Hugging Face; 2026. doi:10.57967/hf/9282 |
|
|
| ## Intended use |
|
|
| - **Intended:** methodological research, benchmarking, and as a baseline for |
| thyroid ultrasound malignancy classification studies. |
| - **Out of scope:** any clinical, diagnostic, triage, or screening use; use on |
| images acquired/preprocessed differently from the training data without |
| re-validation; use on non-thyroid or non-ultrasound images. |
|
|
| ## Dataset |
|
|
| - **Source:** [`Johnyquest7/TN5000-thyroid-nodule-classification`](https://huggingface.co/datasets/Johnyquest7/TN5000-thyroid-nodule-classification) |
| (derived from TN5000; nodule ROI cropped to 224×224 RGB PNG). |
| - **Splits (kept strictly separate):** Train 3,500 · Valid 500 · Test 1,000. |
| - **Labels:** `0 = Benign`, `1 = Malignant` (positive class = Malignant). |
| - **Class balance:** ~70–75% malignant in every split (mild imbalance). See |
| [`data_exploration_report.md`](data_exploration_report.md): **0 corrupt images, |
| 0 cross-split pixel duplicates, 0 filename-ID overlaps** → no detectable leakage. |
|
|
| ## Label definitions |
|
|
| | Label | Class | Meaning | |
| |------:|-------|---------| |
| | 0 | Benign | Non-malignant thyroid nodule | |
| | 1 | Malignant | Malignant thyroid nodule (positive class) | |
|
|
| ## Preprocessing (locked — `configs/preprocess.json`) |
|
|
| Deterministic eval/inference path (no augmentation): |
| 1. Resize to **224×224** (bicubic; the timm A1 data config). |
| 2. `ToTensor()`. |
| 3. Normalize with ImageNet mean `[0.485, 0.456, 0.406]`, std `[0.229, 0.224, 0.225]`. |
|
|
| Grayscale ultrasound images are loaded as 3-channel RGB. |
|
|
| ## Model architecture |
|
|
| ResNet-18 with a single-logit binary head (`num_classes=1`); sigmoid → probability |
| of malignancy. ~11.2M parameters. Trained from ImageNet-1k weights (full fine-tune). |
|
|
| ## Training procedure (final / winning config) |
|
|
| | Setting | Value | |
| |---|---| |
| | Backbone | `timm:resnet18.a1_in1k`, full fine-tune | |
| | Loss | Focal loss (γ=1.0, α=0.5) | |
| | Class-imbalance handling | none beyond focal (mild imbalance; preserves calibration) | |
| | Augmentation | `medical_default`: hflip, mild affine (rot ≤10°, translate 5%, scale 0.9–1.1), mild brightness/contrast (±15%), light Gaussian blur (p=0.2) | |
| | Optimizer | AdamW, lr 2e-4, weight decay 1e-4 | |
| | Scheduler | Cosine annealing, 2-epoch warmup | |
| | Batch size | 32 | |
| | Epochs | ≤40, early stopping on val AUROC (patience 8) | |
| | Mixed precision | yes (fp16 autocast) | |
| | Seed | 42 (strict determinism; `CUBLAS_WORKSPACE_CONFIG=:4096:8`, cuDNN deterministic) | |
| | Best epoch | 6 | |
|
|
| Augmentations were chosen to be medically plausible for B-mode ultrasound |
| (no vertical flip, no large rotation/shear, no aggressive crop, no color/HSV |
| jitter), informed by MediAug (arXiv:2504.18983) and thyroid-US best practice. |
| The augmentation ablation confirmed `medical_default` (0.9712–0.9756 val AUROC) |
| outperforms both `flip_only` (0.9637) and `medical_strong` (0.9609). |
|
|
| Environment: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27, |
| scikit-learn 1.9.0, numpy 2.4.6, NVIDIA A10G (CUDA 13.0, cuDNN 9.2). |
| Full versions in `configs/env_info.json`. |
|
|
| ## Validation threshold strategy |
|
|
| After selecting the model by validation AUROC and calibrating on validation, the |
| decision threshold was chosen on the **validation set** as the **highest-specificity |
| threshold achieving sensitivity ≥ 0.95** (clinically sensitivity-prioritized). |
| Youden's J is reported as a secondary reference. The threshold (**0.7113**) was |
| **locked before** the test set was touched. |
|
|
| - Validation @ locked threshold: sensitivity **0.952**, specificity **0.896**. |
|
|
|
|
| ## Calibration results |
|
|
| Temperature scaling (T=0.5646) on validation reduced **validation ECE from 0.0833 |
| → 0.0308** and Brier from 0.0592 → 0.0525 with AUROC unchanged (monotonic). On |
| the test set, calibrated ECE is **0.0314** (well calibrated). Reliability diagrams: |
| `results/figures/valid_calibration.png`, `results/figures/test_calibration.png`. |
|
|
|
|
| ## Files in this repo |
|
|
| ``` |
| final_model.pt # locked weights + backbone/preprocess metadata |
| configs/final_config.yaml # single-command training config |
| configs/preprocess.json # locked preprocessing |
| configs/calibration.json # temperature scaling parameter + before/after metrics |
| configs/threshold.json # locked decision threshold + selection method |
| configs/env_info.json # exact package/hardware/CUDA versions |
| train.py evaluate.py evaluate_external.py explore_data.py sweep.py finalize.py |
| thyroid_lib.py # shared preprocessing/model/calibration/metrics |
| requirements.txt |
| data_exploration_report.md # full data audit (counts, leakage, intensity, grids) |
| LOG.md # chronological experiment log |
| results/ # figures, tables (incl. CIs + sweep leaderboard), per-image CSVs. Not with TTA |
| ``` |
|
|
| ## Reproduce |
|
|
| ```bash |
| pip install -r requirements.txt |
| python explore_data.py --dataset_id Johnyquest7/TN5000-thyroid-nodule-classification |
| python train.py --config configs/final_config.yaml |
| python evaluate.py --split test --config configs/final_config.yaml |
| ``` |
|
|
| Evaluate a future external dataset with the same preprocessing/calibration/threshold: |
|
|
| ```bash |
| python evaluate_external.py \ |
| --model_repo Johnyquest7/agentic_thyroid_model \ |
| --data_dir /path/to/external_dataset \ |
| --output_dir external_results |
| # external dataset = folder with Benign/ Malignant/ subfolders, OR add --csv labels.csv |
| ``` |
|
|
| ## Limitations, bias, and leakage concerns |
|
|
| - **Single-source dataset; no external validation.** Performance on data from |
| other scanners, institutions, or populations is unknown and likely lower. |
| - **Cropped-ROI inputs.** The model expects nodule-cropped images like TN5000; |
| whole-frame ultrasound will be out of distribution. |
| - **Sensitivity gap at deployment threshold.** The threshold targets ≥0.95 |
| sensitivity on validation; on the test set sensitivity was 0.904 — i.e. the |
| operating point does not perfectly transfer, and ~10% of malignant nodules |
| were missed at this threshold. Threshold re-calibration on local data is |
| advisable before any use. |
| - **Label/selection bias.** Labels and cohort composition reflect the source |
| dataset's referral and pathology-confirmation process. |
| - **Leakage checks were exhaustive within the available signal** (exact pixel |
| hashing + filename-ID overlap, all zero) but cannot rule out near-duplicate or |
| same-patient-different-image leakage if such structure exists in the source. |
|
|
| ## ⚠️ External validation required before clinical use |
|
|
| This model **requires independent, ideally prospective, multi-site external |
| validation** before any clinical consideration. It is released for research |
| reproducibility only. |
|
|
| ## Citation |
|
|
| Dataset: TN5000 (Yu et al., *Scientific Data*, 2025). |
| Augmentation guidance: MediAug, arXiv:2504.18983. |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'Johnyquest7/agentic_thyroid_model' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|