# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2 ## TL;DR We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1** on the held-out test set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) and **approaching the FM_UIA baseline** (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU. - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid) - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset) - **Task**: Binary classification (benign vs malignant) - **Architecture**: SwinV2-Base (86.9M parameters) - **Test Set Size**: 499 images (310 benign, 189 malignant) --- ## Background: Thyroid Nodule Risk Stratification Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored. The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features: 1. **Composition** (cystic, mixed, solid) 2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic) 3. **Shape** (wider-than-tall vs taller-than-wide) 4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension) 5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate) Each feature contributes points, and the total score maps to a TI-RADS category (TR1-TR5) that guides clinical management. While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems. --- ## Dataset We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains: | Split | Images | Benign (0) | Malignant (1) | |-------|--------|-----------|---------------| | Train | 1,993 | 1,236 | 757 | | Validation | 499 | 310 | 189 | | Test (held-out) | 623 | 358 | 265 | - **Modality**: Grayscale ultrasound - **Image sizes**: Variable (~270×270 to ~510×370) - **Class balance**: ~62% benign, ~38% malignant We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation. --- ## Model Architecture We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons: 1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins) 2. **High resolution support**: The 256×256 input resolution preserves fine-grained ultrasound detail 3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features 4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks like FM_UIA 2026 The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end. ### Training Configuration | Hyperparameter | Value | |----------------|-------| | Learning rate | 2e-5 | | Batch size | 16 per device | | Gradient accumulation | 2 steps | | Effective batch size | 32 | | Epochs | 30 (with early stopping, patience=5) | | Warmup steps | 100 | | Weight decay | 0.01 | | Optimizer | AdamW | | Precision | bf16 | | Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter | | Metric for best model | ROC-AUC | --- ## Results ### Validation Set Performance (During Training) | Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC | |-------|-------------|--------|---------------|-----------|-------------| | 1 | 70.1% | 0.472 | 0.714 | 0.352 | 0.783 | | 2 | 72.5% | 0.558 | 0.714 | 0.458 | 0.829 | | 3 | 78.6% | 0.688 | 0.772 | 0.620 | 0.852 | | 4 | 79.4% | 0.703 | 0.778 | 0.641 | 0.858 | | 5 | 80.5% | 0.709 | 0.817 | 0.627 | 0.865 | | 6 | 81.3% | 0.746 | 0.769 | 0.725 | 0.871 | | 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 | | 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 | | 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** | | 13 (best) | **83.4%** | **0.786** | **0.770** | **0.803** | **0.891** | *Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).* ### Final Test Set Performance (Held-Out) | Metric | Value | |--------|-------| | **Accuracy** | **96.4%** | | **ROC-AUC** | **98.7%** | | **Weighted F1** | **96.4%** | | **Weighted Precision** | **96.4%** | | **Weighted Recall** | **96.4%** | | **Sensitivity (Recall)** | **93.7%** | | **Specificity** | **98.1%** | **Confusion Matrix:** ``` Predicted Benign Malignant Actual Benign 304 6 Actual Malignant 12 177 ``` *Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.* --- ## Comparison with Published Benchmarks | Model / Study | Year | Dataset | AUC | Accuracy | F1 | Notes | |---------------|------|---------|-----|----------|-----|-------| | **Human Radiologists** | 2025 | 100 nodules | — | — | — | Sensitivity ~65%, Specificity ~20% (arXiv:2602.00990) | | **ResNet-18 Baseline** | 2025 | TN3K | — | ~80% | ~70% | Standard CNN baseline | | **PEMV-Thyroid** | 2025 | TN3K | — | 82.08% | 75.32% | Multi-view ResNet-18 | | **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result | | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images | | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN | | **Ours (SwinV2-Base)** | 2026 | BTX24 | **98.7%** | **96.4%** | **96.4%** | Fine-tuned from ImageNet-21k | ### Key Observations 1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation. 2. **Exceeding FM_UIA baseline**: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences. 3. **Sensitivity far exceeds radiologists**: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%). 4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning. 5. **Efficient training**: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping. --- ## Clinical Relevance and Limitations ### Why This Matters - **Triage tool**: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists - **Resource-constrained settings**: AI assistance could extend expert-level screening to regions with limited radiologist access - **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring ### Limitations 1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations. 2. **Single dataset**: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed. 3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols. 4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation. 5. **Test-validation gap**: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences. 6. **Regulatory**: This is a research model, not approved for clinical use. --- ## How to Use ```python from transformers import pipeline classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid") result = classifier("thyroid_ultrasound.jpg") print(result) # [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}] ``` --- ## Future Directions 1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data. 2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning. 3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization. 4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions. 5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection. --- ## Citation If you use this model or dataset in your research, please cite: ```bibtex @misc{mlinter_thyroid_2026, title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2}, author={Johnyquest7}, year={2026}, howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}} } ``` --- ## References 1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025. 2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025. 3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025. 4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026. 5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS --- *This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts and documentation: [Johnyquest7/thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*