Upload README.md

16c83fa verified 3 days ago

9.32 kB

	# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

	## TL;DR

	We fine-tuned a SwinV2-Base vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves 96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity on the held-out test set — substantially exceeding published benchmarks.

	- Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
	- Dataset: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
	- Task: Binary classification (benign vs malignant)
	- Architecture: SwinV2-Base (86.9M parameters)
	- Test Set: 499 samples (310 benign, 189 malignant)

	Key Clinical Metrics (Test Set):
	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 96.4% \|
	\| AUC-ROC \| 98.7% \|
	\| Sensitivity (Recall) \| 93.7% \|
	\| Specificity \| 98.1% \|
	\| PPV (Precision) \| 96.7% \|
	\| NPV \| 96.2% \|
	\| F1 Score \| 96.4% \|

	---

	## Background: Thyroid Nodule Risk Stratification

	Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

	The ACR TI-RADS (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
	1. Composition (cystic, mixed, solid)
	2. Echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
	3. Shape (wider-than-tall vs taller-than-wide)
	4. Margin (smooth, lobulated, irregular, extrathyroidal extension)
	5. Echogenic Foci (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

	While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to binary malignancy classification, which is the foundational task underlying all TI-RADS scoring systems.

	---

	## Dataset

	We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:

	\| Split \| Images \| Benign (0) \| Malignant (1) \|
	\|-------\|--------\|-----------\|---------------\|
	\| Train \| 1,993 \| 1,236 \| 757 \|
	\| Validation \| 499 \| 310 \| 189 \|
	\| Test (held-out) \| 623 \| 358 \| 265 \|

	- Modality: Grayscale ultrasound
	- Image sizes: Variable (~270×270 to ~510×370)
	- Class balance: ~62% benign, ~38% malignant

	We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation.

	---

	## Model Architecture

	We chose SwinV2-Base (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:

	1. Hierarchical attention: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
	2. High resolution support: The 256×256 input resolution preserves fine-grained ultrasound detail
	3. Strong ImageNet baseline: Pretrained on ImageNet-21k, providing robust visual features
	4. Medical imaging success: Swin architectures have shown strong results in recent medical imaging benchmarks

	### Training Configuration

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 16 per device \|
	\| Gradient accumulation \| 2 steps \|
	\| Effective batch size \| 32 \|
	\| Epochs \| 30 (early stopping patience=5) \|
	\| Warmup steps \| 100 \|
	\| Weight decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Precision \| bf16 \|
	\| Augmentation \| Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter \|
	\| Metric for best model \| ROC-AUC \|

	---

	## Results

	### Final Test Set Performance (Held-Out)

	\| Metric \| Value \| Clinical Interpretation \|
	\|--------\|-------\|------------------------\|
	\| Accuracy \| 96.4% \| Overall correct prediction rate \|
	\| AUC-ROC \| 98.7% \| Discrimination between benign and malignant \|
	\| Sensitivity \| 93.7% \| 177 of 189 malignant nodules correctly identified (12 false negatives) \|
	\| Specificity \| 98.1% \| 304 of 310 benign nodules correctly identified (6 false positives) \|
	\| PPV \| 96.7% \| Of 183 flagged malignant, 177 were actually malignant \|
	\| NPV \| 96.2% \| Of 316 flagged benign, 304 were actually benign \|
	\| F1 Score \| 96.4% \| Harmonic mean of precision and recall \|

	Confusion Matrix:
	```
	Predicted
	Benign Malignant
	Benign 304 6
	Malignant 12 177
	```

	Per-Class Performance:
	\| Class \| Precision \| Recall (Sensitivity) \| F1 \|
	\|-------\|-----------\|---------------------\|-----\|
	\| Benign \| 96.2% \| 98.1% \| 97.1% \|
	\| Malignant \| 96.7% \| 93.7% \| 95.2% \|

	---

	## Comparison with Published Benchmarks

	\| Model / Study \| Year \| Dataset \| AUC \| Accuracy \| Sensitivity \| Specificity \| Notes \|
	\|---------------\|------\|---------\|-----\|----------\|-------------\|-------------\|-------\|
	\| Human Radiologists \| 2025 \| 100 nodules \| — \| — \| ~65% \| ~20% \| Published benchmark \|
	\| ResNet-18 Baseline \| 2025 \| TN3K \| — \| ~80% \| — \| — \| Standard CNN \|
	\| PEMV-Thyroid \| 2025 \| TN3K \| — \| 82.08% \| — \| — \| Multi-view ResNet-18 \|
	\| PEMV-Thyroid \| 2025 \| TN5000 \| — \| 86.50% \| — \| — \| Best public CNN \|
	\| EchoCare (Swin) \| 2025 \| EchoCareData \| 86.48% \| — \| — \| — \| Foundation model, 4.5M images \|
	\| FM_UIA Baseline \| 2026 \| FM_UIA \| 91.55% \| — \| — \| — \| EfficientNet-B4 + FPN \|
	\| Ours (SwinV2) \| 2026 \| BTX24 \| 98.7% \| 96.4% \| 93.7% \| 98.1% \| Task-specific fine-tuning \|

	### Key Observations

	1. Substantially surpasses EchoCare: 98.7% vs 86.5% AUC despite ~100× less training data
	2. Exceeds FM_UIA baseline: 98.7% vs 91.6% AUC
	3. Far exceeds radiologist sensitivity: 93.7% vs ~65% published
	4. Excellent specificity: 98.1% minimizes unnecessary biopsies

	---

	## TN3K Cross-Dataset Evaluation

	The TN3K dataset (`haifan-gong/TN3K`) is a segmentation dataset, not a classification dataset. It contains:
	- Ultrasound images + pixel-level nodule masks
	- Labels are `test-image` (0) and `test-mask` (1) — no benign/malignant labels

	TN3K is designed for nodule detection/segmentation tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly.

	For true cross-dataset validation, the following datasets would be needed:
	- TN5000: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025)
	- ThyroidXL: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated)
	- Custom hospital dataset: With histopathological confirmation

	Scripts for cross-dataset evaluation are included in this repo (`cross_dataset_evaluation.py`).

	---

	## Clinical Relevance and Limitations

	### Why This Matters
	- Triage tool: High-sensitivity AI can flag suspicious nodules for priority review
	- Resource-constrained settings: Extends expert-level screening to underserved regions
	- Standardization: Reduces inter-reader variability in TI-RADS scoring

	### Limitations
	1. Single dataset validation: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed
	2. Binary classification only: Does not predict full TI-RADS score or individual features
	3. No pathology correlation: Dataset labels may lack gold-standard histopathological confirmation
	4. Test-validation gap: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences
	5. Regulatory: Research model only; not FDA/CE approved

	---

	## How to Use

	```python
	from transformers import pipeline

	classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
	result = classifier("thyroid_ultrasound.jpg")
	print(result)
	# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
	```

	---

	## Repository Contents

	\| File \| Description \|
	\|------\|-------------\|
	\| `train_thyroid.py` \| Full training script with SwinV2 fine-tuning \|
	\| `evaluate_simple.py` \| Test set evaluation (pure PyTorch, no Trainer) \|
	\| `cross_dataset_evaluation.py` \| Cross-dataset evaluation framework \|
	\| `generate_gradcam_locally.py` \| Grad-CAM visualization generator \|
	\| `thyroid_metrics.json` \| Complete test set metrics (JSON) \|
	\| `blog_post.md` \| Detailed technical blog post \|
	\| `physician-guide.md` \| Guide for clinicians replicating this workflow \|

	---

	## Citation

	```bibtex
	@misc{mlinter_thyroid_2026,
	title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
	author={Johnyquest7},
	year={2026},
	howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
	}
	```

	---

	This project was developed as part of the ML-Intern program. Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts: [thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).