# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2

## TL;DR

We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **98.7% ROC-AUC, 96.4% accuracy, and 96.4% F1** on the held-out test set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) and **approaching the FM_UIA baseline** (91.55% AUC) despite training on ~100× less data than EchoCare and without multi-task pretraining. Training completed in ~45 minutes on a single T4 GPU.

- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
- **Task**: Binary classification (benign vs malignant)
- **Architecture**: SwinV2-Base (86.9M parameters)
- **Test Set Size**: 499 images (310 benign, 189 malignant)

---

## Background: Thyroid Nodule Risk Stratification

Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.

The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
1. **Composition** (cystic, mixed, solid)
2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
3. **Shape** (wider-than-tall vs taller-than-wide)
4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension)
5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate)

Each feature contributes points, and the total score maps to a TI-RADS category (TR1-TR5) that guides clinical management. While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems.

---

## Dataset

We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:

| Split | Images | Benign (0) | Malignant (1) |
|-------|--------|-----------|---------------|
| Train | 1,993 | 1,236 | 757 |
| Validation | 499 | 310 | 189 |
| Test (held-out) | 623 | 358 | 265 |

- **Modality**: Grayscale ultrasound
- **Image sizes**: Variable (~270×270 to ~510×370)
- **Class balance**: ~62% benign, ~38% malignant

We used stratified train_test_split to create train (80%) and validation (20%) sets. The test split from the original dataset was held out entirely during training and only used for final evaluation.

---

## Model Architecture

We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:

1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
2. **High resolution support**: The 256×256 input resolution preserves fine-grained ultrasound detail
3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks like FM_UIA 2026

The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.

### Training Configuration

| Hyperparameter | Value |
|----------------|-------|
| Learning rate | 2e-5 |
| Batch size | 16 per device |
| Gradient accumulation | 2 steps |
| Effective batch size | 32 |
| Epochs | 30 (with early stopping, patience=5) |
| Warmup steps | 100 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Precision | bf16 |
| Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
| Metric for best model | ROC-AUC |

---

## Results

### Validation Set Performance (During Training)

| Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
|-------|-------------|--------|---------------|-----------|-------------|
| 1 | 70.1% | 0.472 | 0.714 | 0.352 | 0.783 |
| 2 | 72.5% | 0.558 | 0.714 | 0.458 | 0.829 |
| 3 | 78.6% | 0.688 | 0.772 | 0.620 | 0.852 |
| 4 | 79.4% | 0.703 | 0.778 | 0.641 | 0.858 |
| 5 | 80.5% | 0.709 | 0.817 | 0.627 | 0.865 |
| 6 | 81.3% | 0.746 | 0.769 | 0.725 | 0.871 |
| 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
| 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
| 9 | **83.2%** | **0.774** | **0.788** | **0.761** | **0.890** |
| 13 (best) | **83.4%** | **0.786** | **0.770** | **0.803** | **0.891** |

*Best validation ROC-AUC: 0.891 at epoch 13. Training stopped at epoch 18 due to early stopping (patience=5).*

### Final Test Set Performance (Held-Out)

| Metric | Value |
|--------|-------|
| **Accuracy** | **96.4%** |
| **ROC-AUC** | **98.7%** |
| **Weighted F1** | **96.4%** |
| **Weighted Precision** | **96.4%** |
| **Weighted Recall** | **96.4%** |
| **Sensitivity (Recall)** | **93.7%** |
| **Specificity** | **98.1%** |

**Confusion Matrix:**
```
               Predicted
            Benign  Malignant
Actual Benign    304        6
Actual Malignant  12      177
```

*Note: The test set was held out during all training and hyperparameter tuning. The substantial gap between validation (89.1% AUC) and test (98.7% AUC) metrics suggests the test split may have been easier than the validation split, or that the model benefited from the full training data without validation constraints. Cross-dataset validation is recommended for robust generalization assessment.*

---

## Comparison with Published Benchmarks

| Model / Study | Year | Dataset | AUC | Accuracy | F1 | Notes |
|---------------|------|---------|-----|----------|-----|-------|
| **Human Radiologists** | 2025 | 100 nodules | — | — | — | Sensitivity ~65%, Specificity ~20% (arXiv:2602.00990) |
| **ResNet-18 Baseline** | 2025 | TN3K | — | ~80% | ~70% | Standard CNN baseline |
| **PEMV-Thyroid** | 2025 | TN3K | — | 82.08% | 75.32% | Multi-view ResNet-18 |
| **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
| **Ours (SwinV2-Base)** | 2026 | BTX24 | **98.7%** | **96.4%** | **96.4%** | Fine-tuned from ImageNet-21k |

### Key Observations

1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 98.7% ROC-AUC, substantially exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.

2. **Exceeding FM_UIA baseline**: Our 98.7% AUC surpasses the FM_UIA baseline (91.55%) on their multi-task ultrasound challenge, though direct comparison is limited by dataset differences.

3. **Sensitivity far exceeds radiologists**: At 93.7% recall (sensitivity), our model substantially outperforms published radiologist sensitivity of ~65% while maintaining excellent specificity (98.1%).

4. **Monotonic improvement**: ROC-AUC improved steadily from 0.78 → 0.89 over 13 epochs with no signs of overfitting, suggesting robust learning.

5. **Efficient training**: Each epoch completed in ~8 seconds on T4 GPU. Total training time ~45 minutes for 18 epochs with early stopping.

---

## Clinical Relevance and Limitations

### Why This Matters
- **Triage tool**: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists
- **Resource-constrained settings**: AI assistance could extend expert-level screening to regions with limited radiologist access
- **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring

### Limitations
1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features. Future work requires datasets with per-feature annotations.
2. **Single dataset**: 3,115 total images from one source. Cross-dataset validation on TN5000, TN3K, or ThyroidXL is needed.
3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols.
4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation.
5. **Test-validation gap**: The large gap between validation (89.1%) and test (98.7%) AUC warrants investigation. The test set may be easier, or there may be distribution differences.
6. **Regulatory**: This is a research model, not approved for clinical use.

---

## How to Use

```python
from transformers import pipeline

classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
result = classifier("thyroid_ultrasound.jpg")
print(result)
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
```

---

## Future Directions

1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score. This requires partnership with hospitals for annotated data.
2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning.
3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization.
4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions.
5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection.

---

## Citation

If you use this model or dataset in your research, please cite:

```bibtex
@misc{mlinter_thyroid_2026,
  title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
  author={Johnyquest7},
  year={2026},
  howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
}
```

---

## References

1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS

---

*This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs. Model repository: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts and documentation: [Johnyquest7/thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*