add citation

7a26115 verified 10 days ago

8.84 kB

	---
	license: cc-by-nc-4.0
	tags:
	- medical-imaging
	- thyroid
	- ultrasound
	- image-classification
	- resnet18
	- calibration
	- ml-intern
	pipeline_tag: image-classification
	---

	# Agentic Thyroid ResNet-18 — Ultrasound Nodule Malignancy Classifier

	> ⚠️ RESEARCH USE ONLY — NOT FOR CLINICAL USE. This model is a research
	> artifact trained on a single retrospective dataset. It has not been
	> externally validated and must not be used for diagnosis, screening, or any
	> clinical decision-making. External, prospective validation is required before
	> any clinical consideration.

	A ResNet-18 binary classifier that predicts the probability that a cropped
	thyroid ultrasound nodule image is malignant (positive class) vs benign.
	Built for a reproducible, publication-oriented experiment with proper
	calibration and a sensitivity-prioritized, validation-locked decision threshold.

	- Backbone: `timm` ResNet-18, A1 ImageNet-1k recipe (`resnet18.a1_in1k`), full fine-tune
	- Selected by: validation AUROC (14-trial sweep; winner val AUROC 0.9756)
	- Calibration: temperature scaling (T = 0.5646), fit on validation
	- Locked threshold: 0.7113 (highest-specificity threshold with validation sensitivity ≥ 0.95)

	## Cite this model

	Thomas J. agentic_thyroid_model [software]. Revision e4f94ea. Hugging Face; 2026. doi:10.57967/hf/9282

	## Intended use

	- Intended: methodological research, benchmarking, and as a baseline for
	thyroid ultrasound malignancy classification studies.
	- Out of scope: any clinical, diagnostic, triage, or screening use; use on
	images acquired/preprocessed differently from the training data without
	re-validation; use on non-thyroid or non-ultrasound images.

	## Dataset

	- Source: [`Johnyquest7/TN5000-thyroid-nodule-classification`](https://huggingface.co/datasets/Johnyquest7/TN5000-thyroid-nodule-classification)
	(derived from TN5000; nodule ROI cropped to 224×224 RGB PNG).
	- Splits (kept strictly separate): Train 3,500 · Valid 500 · Test 1,000.
	- Labels: `0 = Benign`, `1 = Malignant` (positive class = Malignant).
	- Class balance: ~70–75% malignant in every split (mild imbalance). See
	[`data_exploration_report.md`](data_exploration_report.md): **0 corrupt images,
	0 cross-split pixel duplicates, 0 filename-ID overlaps** → no detectable leakage.

	## Label definitions

	\| Label \| Class \| Meaning \|
	\|------:\|-------\|---------\|
	\| 0 \| Benign \| Non-malignant thyroid nodule \|
	\| 1 \| Malignant \| Malignant thyroid nodule (positive class) \|

	## Preprocessing (locked — `configs/preprocess.json`)

	Deterministic eval/inference path (no augmentation):
	1. Resize to 224×224 (bicubic; the timm A1 data config).
	2. `ToTensor()`.
	3. Normalize with ImageNet mean `[0.485, 0.456, 0.406]`, std `[0.229, 0.224, 0.225]`.

	Grayscale ultrasound images are loaded as 3-channel RGB.

	## Model architecture

	ResNet-18 with a single-logit binary head (`num_classes=1`); sigmoid → probability
	of malignancy. ~11.2M parameters. Trained from ImageNet-1k weights (full fine-tune).

	## Training procedure (final / winning config)

	\| Setting \| Value \|
	\|---\|---\|
	\| Backbone \| `timm:resnet18.a1_in1k`, full fine-tune \|
	\| Loss \| Focal loss (γ=1.0, α=0.5) \|
	\| Class-imbalance handling \| none beyond focal (mild imbalance; preserves calibration) \|
	\| Augmentation \| `medical_default`: hflip, mild affine (rot ≤10°, translate 5%, scale 0.9–1.1), mild brightness/contrast (±15%), light Gaussian blur (p=0.2) \|
	\| Optimizer \| AdamW, lr 2e-4, weight decay 1e-4 \|
	\| Scheduler \| Cosine annealing, 2-epoch warmup \|
	\| Batch size \| 32 \|
	\| Epochs \| ≤40, early stopping on val AUROC (patience 8) \|
	\| Mixed precision \| yes (fp16 autocast) \|
	\| Seed \| 42 (strict determinism; `CUBLAS_WORKSPACE_CONFIG=:4096:8`, cuDNN deterministic) \|
	\| Best epoch \| 6 \|

	Augmentations were chosen to be medically plausible for B-mode ultrasound
	(no vertical flip, no large rotation/shear, no aggressive crop, no color/HSV
	jitter), informed by MediAug (arXiv:2504.18983) and thyroid-US best practice.
	The augmentation ablation confirmed `medical_default` (0.9712–0.9756 val AUROC)
	outperforms both `flip_only` (0.9637) and `medical_strong` (0.9609).

	Environment: torch 2.12.0+cu130, torchvision 0.27.0+cu130, timm 1.0.27,
	scikit-learn 1.9.0, numpy 2.4.6, NVIDIA A10G (CUDA 13.0, cuDNN 9.2).
	Full versions in `configs/env_info.json`.

	## Validation threshold strategy

	After selecting the model by validation AUROC and calibrating on validation, the
	decision threshold was chosen on the validation set as the **highest-specificity
	threshold achieving sensitivity ≥ 0.95** (clinically sensitivity-prioritized).
	Youden's J is reported as a secondary reference. The threshold (0.7113) was
	locked before the test set was touched.

	- Validation @ locked threshold: sensitivity 0.952, specificity 0.896.


	## Calibration results

	Temperature scaling (T=0.5646) on validation reduced **validation ECE from 0.0833
	→ 0.0308** and Brier from 0.0592 → 0.0525 with AUROC unchanged (monotonic). On
	the test set, calibrated ECE is 0.0314 (well calibrated). Reliability diagrams:
	`results/figures/valid_calibration.png`, `results/figures/test_calibration.png`.


	## Files in this repo

	```
	final_model.pt # locked weights + backbone/preprocess metadata
	configs/final_config.yaml # single-command training config
	configs/preprocess.json # locked preprocessing
	configs/calibration.json # temperature scaling parameter + before/after metrics
	configs/threshold.json # locked decision threshold + selection method
	configs/env_info.json # exact package/hardware/CUDA versions
	train.py evaluate.py evaluate_external.py explore_data.py sweep.py finalize.py
	thyroid_lib.py # shared preprocessing/model/calibration/metrics
	requirements.txt
	data_exploration_report.md # full data audit (counts, leakage, intensity, grids)
	LOG.md # chronological experiment log
	results/ # figures, tables (incl. CIs + sweep leaderboard), per-image CSVs. Not with TTA
	```

	## Reproduce

	```bash
	pip install -r requirements.txt
	python explore_data.py --dataset_id Johnyquest7/TN5000-thyroid-nodule-classification
	python train.py --config configs/final_config.yaml
	python evaluate.py --split test --config configs/final_config.yaml
	```

	Evaluate a future external dataset with the same preprocessing/calibration/threshold:

	```bash
	python evaluate_external.py \
	--model_repo Johnyquest7/agentic_thyroid_model \
	--data_dir /path/to/external_dataset \
	--output_dir external_results
	# external dataset = folder with Benign/ Malignant/ subfolders, OR add --csv labels.csv
	```

	## Limitations, bias, and leakage concerns

	- Single-source dataset; no external validation. Performance on data from
	other scanners, institutions, or populations is unknown and likely lower.
	- Cropped-ROI inputs. The model expects nodule-cropped images like TN5000;
	whole-frame ultrasound will be out of distribution.
	- Sensitivity gap at deployment threshold. The threshold targets ≥0.95
	sensitivity on validation; on the test set sensitivity was 0.904 — i.e. the
	operating point does not perfectly transfer, and ~10% of malignant nodules
	were missed at this threshold. Threshold re-calibration on local data is
	advisable before any use.
	- Label/selection bias. Labels and cohort composition reflect the source
	dataset's referral and pathology-confirmation process.
	- Leakage checks were exhaustive within the available signal (exact pixel
	hashing + filename-ID overlap, all zero) but cannot rule out near-duplicate or
	same-patient-different-image leakage if such structure exists in the source.

	## ⚠️ External validation required before clinical use

	This model **requires independent, ideally prospective, multi-site external
	validation** before any clinical consideration. It is released for research
	reproducibility only.

	## Citation

	Dataset: TN5000 (Yu et al., Scientific Data, 2025).
	Augmentation guidance: MediAug, arXiv:2504.18983.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = 'Johnyquest7/agentic_thyroid_model'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.