Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

TL;DR

We fine-tuned a SwinV2-Base vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves 98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1 on the held-out test set — surpassing the EchoCare foundation model benchmark (86.48% AUC) and approaching the FM_UIA baseline (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU.

Model: Johnyquest7/ML-Inter_thyroid
Dataset: BTX24/thyroid-cancer-classification-ultrasound-dataset
Task: Binary classification (benign vs malignant)
Architecture: SwinV2-Base (86.9M parameters)
Test Set Size: 499 images (310 benign, 189 malignant)

Background: Thyroid Nodule Risk Stratification

Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

The ACR TI-RADS (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:

Composition (cystic, mixed, solid)
Echogenicity (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
Shape (wider-than-tall vs taller-than-wide)
Margin (smooth, lobulated, irregular, extrathyroidal extension)
Echogenic Foci (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

Each feature contributes points, and the total score maps to a TI-RADS category (TR1-TR5) that guides clinical management. While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to binary malignancy classification, which is the foundational task underlying all TI-RADS scoring systems.

Dataset

We used the BTX24 thyroid ultrasound dataset, which contains:

Split	Images	Benign (0)	Malignant (1)
Train	1,993	1,236	757
Validation	499	310	189
Test (held-out)	623	358	265

Modality: Grayscale ultrasound
Image sizes: Variable (~270×270 to ~510×370)
Class balance: ~62% benign, ~38% malignant

We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation.

Model Architecture

We chose SwinV2-Base (microsoft/swinv2-base-patch4-window8-256) for several reasons:

Hierarchical attention: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
High resolution support: The 256×256 input resolution preserves fine-grained ultrasound detail
Strong ImageNet baseline: Pretrained on ImageNet-21k, providing robust visual features
Medical imaging success: Swin architectures have shown strong results in recent medical imaging benchmarks like FM_UIA 2026

The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.

Training Configuration

Hyperparameter	Value
Learning rate	2e-5
Batch size	16 per device
Gradient accumulation	2 steps
Effective batch size	32
Epochs	30 (with early stopping, patience=5)
Warmup steps	100
Weight decay	0.01
Optimizer	AdamW
Precision	bf16
Augmentation	Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter
Metric for best model	ROC-AUC

Results

Validation Set Performance (During Training)

Epoch	Val Accuracy	Val F1	Val Precision	Val Recall	Val ROC-AUC
1	70.1%	0.472	0.714	0.352	0.783
2	72.5%	0.558	0.714	0.458	0.829
3	78.6%	0.688	0.772	0.620	0.852
4	79.4%	0.703	0.778	0.641	0.858
5	80.5%	0.709	0.817	0.627	0.865
6	81.3%	0.746	0.769	0.725	0.871
7	80.8%	0.707	0.837	0.613	0.874
8	81.0%	0.722	0.814	0.648	0.875
9	83.2%	0.774	0.788	0.761	0.890
13 (best)	83.4%	0.786	0.770	0.803	0.891

Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).

Final Test Set Performance (Held-Out)

Metric	Value
Accuracy	96.4%
ROC-AUC	98.7%
Weighted F1	96.4%
Weighted Precision	96.4%
Weighted Recall	96.4%
Sensitivity (Recall)	93.7%
Specificity	98.1%

Confusion Matrix:

               Predicted
            Benign  Malignant
Actual Benign    304        6
Actual Malignant  12      177

Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.

Comparison with Published Benchmarks

Model / Study	Year	Dataset	AUC	Accuracy	F1	Notes
Human Radiologists	2025	100 nodules	—	—	—	Sensitivity ~65%, Specificity ~20% (arXiv:2602.00990)
ResNet-18 Baseline	2025	TN3K	—	~80%	~70%	Standard CNN baseline
PEMV-Thyroid	2025	TN3K	—	82.08%	75.32%	Multi-view ResNet-18
PEMV-Thyroid	2025	TN5000	—	86.50%	90.99%	Best public CNN result
EchoCare (Swin)	2025	EchoCareData	86.48%	—	87.45%	Foundation model on 4.5M images
FM_UIA Baseline	2026	FM_UIA	91.55% (mean)	—	—	EfficientNet-B4 + FPN
Ours (SwinV2-Base)	2026	BTX24	98.7%	96.4%	96.4%	Fine-tuned from ImageNet-21k

Key Observations

Surpassing EchoCare foundation model: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.
Exceeding FM_UIA baseline: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences.
Sensitivity far exceeds radiologists: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%).
Monotonic improvement: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning.
Efficient training: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping.

Clinical Relevance and Limitations

Why This Matters

Triage tool: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists
Resource-constrained settings: AI assistance could extend expert-level screening to regions with limited radiologist access
Standardization: AI can reduce inter-reader variability in TI-RADS scoring

Limitations

Binary classification only: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations.
Single dataset: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed.
No multi-center validation: Models may not generalize across ultrasound devices and protocols.
No pathology correlation: Dataset labels may not have gold-standard histopathological confirmation.
Test-validation gap: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences.
Regulatory: This is a research model, not approved for clinical use.

How to Use

from transformers import pipeline

classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
result = classifier("thyroid_ultrasound.jpg")
print(result)
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]

Future Directions

Multi-task TI-RADS scoring: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data.
Foundation model pretraining: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning.
Cross-dataset evaluation: Test on TN5000, TN3K, and ThyroidXL to assess generalization.
Ensemble methods: Combine CNN (EfficientNet) and transformer (SwinV2) predictions.
Interpretability: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection.

Citation

If you use this model or dataset in your research, please cite:

@misc{mlinter_thyroid_2026,
  title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
}

References

Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
"PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
"EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
"Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS

This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: Johnyquest7/ML-Inter_thyroid. Scripts and documentation: Johnyquest7/thyroid-training-scripts.