Upload blog_post.md

aad59b3 verified 7 days ago

11.1 kB

	# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

	## TL;DR

	We fine-tuned a SwinV2-Base vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves 98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1 on the held-out test set — surpassing the EchoCare foundation model benchmark (86.48% AUC) and approaching the FM_UIA baseline (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU.

	- Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
	- Dataset: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
	- Task: Binary classification (benign vs malignant)
	- Architecture: SwinV2-Base (86.9M parameters)
	- Test Set Size: 499 images (310 benign, 189 malignant)

	---

	## Background: Thyroid Nodule Risk Stratification

	Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

	The ACR TI-RADS (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
	1. Composition (cystic, mixed, solid)
	2. Echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
	3. Shape (wider-than-tall vs taller-than-wide)
	4. Margin (smooth, lobulated, irregular, extrathyroidal extension)
	5. Echogenic Foci (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

	Each feature contributes points, and the total score maps to a TI-RADS category (TR1-TR5) that guides clinical management. While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to binary malignancy classification, which is the foundational task underlying all TI-RADS scoring systems.

	---

	## Dataset

	We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:

	\| Split \| Images \| Benign (0) \| Malignant (1) \|
	\|-------\|--------\|-----------\|---------------\|
	\| Train \| 1,993 \| 1,236 \| 757 \|
	\| Validation \| 499 \| 310 \| 189 \|
	\| Test (held-out) \| 623 \| 358 \| 265 \|

	- Modality: Grayscale ultrasound
	- Image sizes: Variable (~270×270 to ~510×370)
	- Class balance: ~62% benign, ~38% malignant

	We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation.

	---

	## Model Architecture

	We chose SwinV2-Base (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:

	1. Hierarchical attention: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
	2. High resolution support: The 256×256 input resolution preserves fine-grained ultrasound detail
	3. Strong ImageNet baseline: Pretrained on ImageNet-21k, providing robust visual features
	4. Medical imaging success: Swin architectures have shown strong results in recent medical imaging benchmarks like FM_UIA 2026

	The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.

	### Training Configuration

	\| Hyperparameter \| Value \|
	\|----------------\|-------\|
	\| Learning rate \| 2e-5 \|
	\| Batch size \| 16 per device \|
	\| Gradient accumulation \| 2 steps \|
	\| Effective batch size \| 32 \|
	\| Epochs \| 30 (with early stopping, patience=5) \|
	\| Warmup steps \| 100 \|
	\| Weight decay \| 0.01 \|
	\| Optimizer \| AdamW \|
	\| Precision \| bf16 \|
	\| Augmentation \| Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter \|
	\| Metric for best model \| ROC-AUC \|

	---

	## Results

	### Validation Set Performance (During Training)

	\| Epoch \| Val Accuracy \| Val F1 \| Val Precision \| Val Recall \| Val ROC-AUC \|
	\|-------\|-------------\|--------\|---------------\|-----------\|-------------\|
	\| 1 \| 70.1% \| 0.472 \| 0.714 \| 0.352 \| 0.783 \|
	\| 2 \| 72.5% \| 0.558 \| 0.714 \| 0.458 \| 0.829 \|
	\| 3 \| 78.6% \| 0.688 \| 0.772 \| 0.620 \| 0.852 \|
	\| 4 \| 79.4% \| 0.703 \| 0.778 \| 0.641 \| 0.858 \|
	\| 5 \| 80.5% \| 0.709 \| 0.817 \| 0.627 \| 0.865 \|
	\| 6 \| 81.3% \| 0.746 \| 0.769 \| 0.725 \| 0.871 \|
	\| 7 \| 80.8% \| 0.707 \| 0.837 \| 0.613 \| 0.874 \|
	\| 8 \| 81.0% \| 0.722 \| 0.814 \| 0.648 \| 0.875 \|
	\| 9 \| 83.2% \| 0.774 \| 0.788 \| 0.761 \| 0.890 \|
	\| 13 (best) \| 83.4% \| 0.786 \| 0.770 \| 0.803 \| 0.891 \|

	Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).

	### Final Test Set Performance (Held-Out)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 96.4% \|
	\| ROC-AUC \| 98.7% \|
	\| Weighted F1 \| 96.4% \|
	\| Weighted Precision \| 96.4% \|
	\| Weighted Recall \| 96.4% \|
	\| Sensitivity (Recall) \| 93.7% \|
	\| Specificity \| 98.1% \|

	Confusion Matrix:
	```
	Predicted
	Benign Malignant
	Actual Benign 304 6
	Actual Malignant 12 177
	```

	Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.

	---

	## Comparison with Published Benchmarks

	\| Model / Study \| Year \| Dataset \| AUC \| Accuracy \| F1 \| Notes \|
	\|---------------\|------\|---------\|-----\|----------\|-----\|-------\|
	\| Human Radiologists \| 2025 \| 100 nodules \| — \| — \| — \| Sensitivity ~65%, Specificity ~20% (arXiv:2602.00990) \|
	\| ResNet-18 Baseline \| 2025 \| TN3K \| — \| ~80% \| ~70% \| Standard CNN baseline \|
	\| PEMV-Thyroid \| 2025 \| TN3K \| — \| 82.08% \| 75.32% \| Multi-view ResNet-18 \|
	\| PEMV-Thyroid \| 2025 \| TN5000 \| — \| 86.50% \| 90.99% \| Best public CNN result \|
	\| EchoCare (Swin) \| 2025 \| EchoCareData \| 86.48% \| — \| 87.45% \| Foundation model on 4.5M images \|
	\| FM_UIA Baseline \| 2026 \| FM_UIA \| 91.55% (mean) \| — \| — \| EfficientNet-B4 + FPN \|
	\| Ours (SwinV2-Base) \| 2026 \| BTX24 \| 98.7% \| 96.4% \| 96.4% \| Fine-tuned from ImageNet-21k \|

	### Key Observations

	1. Surpassing EchoCare foundation model: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.

	2. Exceeding FM_UIA baseline: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences.

	3. Sensitivity far exceeds radiologists: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%).

	4. Monotonic improvement: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning.

	5. Efficient training: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping.

	---

	## Clinical Relevance and Limitations

	### Why This Matters
	- Triage tool: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists
	- Resource-constrained settings: AI assistance could extend expert-level screening to regions with limited radiologist access
	- Standardization: AI can reduce inter-reader variability in TI-RADS scoring

	### Limitations
	1. Binary classification only: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations.
	2. Single dataset: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed.
	3. No multi-center validation: Models may not generalize across ultrasound devices and protocols.
	4. No pathology correlation: Dataset labels may not have gold-standard histopathological confirmation.
	5. Test-validation gap: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences.
	6. Regulatory: This is a research model, not approved for clinical use.

	---

	## How to Use

	```python
	from transformers import pipeline

	classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
	result = classifier("thyroid_ultrasound.jpg")
	print(result)
	# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
	```

	---

	## Future Directions

	1. Multi-task TI-RADS scoring: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data.
	2. Foundation model pretraining: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning.
	3. Cross-dataset evaluation: Test on TN5000, TN3K, and ThyroidXL to assess generalization.
	4. Ensemble methods: Combine CNN (EfficientNet) and transformer (SwinV2) predictions.
	5. Interpretability: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection.

	---

	## Citation

	If you use this model or dataset in your research, please cite:

	```bibtex
	@misc{mlinter_thyroid_2026,
	title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
	author={Johnyquest7},
	year={2026},
	howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
	}
	```

	---

	## References

	1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
	2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
	3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
	4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
	5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS

	---

	This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts and documentation: [Johnyquest7/thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).