# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2 ## TL;DR We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity** on the held-out test set — substantially exceeding published benchmarks. - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid) - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset) - **Task**: Binary classification (benign vs malignant) - **Architecture**: SwinV2-Base (86.9M parameters) - **Test Set**: 499 samples (310 benign, 189 malignant) **Key Clinical Metrics (Test Set):** | Metric | Value | |--------|-------| | Accuracy | **96.4%** | | AUC-ROC | **98.7%** | | Sensitivity (Recall) | **93.7%** | | Specificity | **98.1%** | | PPV (Precision) | **96.7%** | | NPV | **96.2%** | | F1 Score | **96.4%** | --- ## Background: Thyroid Nodule Risk Stratification Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored. The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features: 1. **Composition** (cystic, mixed, solid) 2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic) 3. **Shape** (wider-than-tall vs taller-than-wide) 4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension) 5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate) While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems. --- ## Dataset We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains: | Split | Images | Benign (0) | Malignant (1) | |-------|--------|-----------|---------------| | Train | 1,993 | 1,236 | 757 | | Validation | 499 | 310 | 189 | | Test (held-out) | 623 | 358 | 265 | - **Modality**: Grayscale ultrasound - **Image sizes**: Variable (~270×270 to ~510×370) - **Class balance**: ~62% benign, ~38% malignant We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation. --- ## Model Architecture We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons: 1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins) 2. **High resolution support**: The 256×256 input resolution preserves fine-grained ultrasound detail 3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features 4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks ### Training Configuration | Hyperparameter | Value | |----------------|-------| | Learning rate | 2e-5 | | Batch size | 16 per device | | Gradient accumulation | 2 steps | | Effective batch size | 32 | | Epochs | 30 (early stopping patience=5) | | Warmup steps | 100 | | Weight decay | 0.01 | | Optimizer | AdamW | | Precision | bf16 | | Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter | | Metric for best model | ROC-AUC | --- ## Results ### Final Test Set Performance (Held-Out) | Metric | Value | Clinical Interpretation | |--------|-------|------------------------| | **Accuracy** | **96.4%** | Overall correct prediction rate | | **AUC-ROC** | **98.7%** | Discrimination between benign and malignant | | **Sensitivity** | **93.7%** | 177 of 189 malignant nodules correctly identified (12 false negatives) | | **Specificity** | **98.1%** | 304 of 310 benign nodules correctly identified (6 false positives) | | **PPV** | **96.7%** | Of 183 flagged malignant, 177 were actually malignant | | **NPV** | **96.2%** | Of 316 flagged benign, 304 were actually benign | | **F1 Score** | **96.4%** | Harmonic mean of precision and recall | **Confusion Matrix:** ``` Predicted Benign Malignant Benign 304 6 Malignant 12 177 ``` **Per-Class Performance:** | Class | Precision | Recall (Sensitivity) | F1 | |-------|-----------|---------------------|-----| | Benign | 96.2% | 98.1% | 97.1% | | Malignant | 96.7% | 93.7% | 95.2% | --- ## Comparison with Published Benchmarks | Model / Study | Year | Dataset | AUC | Accuracy | Sensitivity | Specificity | Notes | |---------------|------|---------|-----|----------|-------------|-------------|-------| | **Human Radiologists** | 2025 | 100 nodules | — | — | ~65% | ~20% | Published benchmark | | **ResNet-18 Baseline** | 2025 | TN3K | — | ~80% | — | — | Standard CNN | | **PEMV-Thyroid** | 2025 | TN3K | — | 82.08% | — | — | Multi-view ResNet-18 | | **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | — | — | Best public CNN | | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | — | — | Foundation model, 4.5M images | | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% | — | — | — | EfficientNet-B4 + FPN | | **Ours (SwinV2)** | **2026** | **BTX24** | **98.7%** | **96.4%** | **93.7%** | **98.1%** | **Task-specific fine-tuning** | ### Key Observations 1. **Substantially surpasses EchoCare**: 98.7% vs 86.5% AUC despite ~100× less training data 2. **Exceeds FM_UIA baseline**: 98.7% vs 91.6% AUC 3. **Far exceeds radiologist sensitivity**: 93.7% vs ~65% published 4. **Excellent specificity**: 98.1% minimizes unnecessary biopsies --- ## TN3K Cross-Dataset Evaluation **The TN3K dataset (`haifan-gong/TN3K`) is a segmentation dataset**, not a classification dataset. It contains: - Ultrasound images + **pixel-level nodule masks** - Labels are `test-image` (0) and `test-mask` (1) — **no benign/malignant labels** TN3K is designed for **nodule detection/segmentation** tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly. **For true cross-dataset validation**, the following datasets would be needed: - **TN5000**: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025) - **ThyroidXL**: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated) - **Custom hospital dataset**: With histopathological confirmation Scripts for cross-dataset evaluation are included in this repo (`cross_dataset_evaluation.py`). --- ## Clinical Relevance and Limitations ### Why This Matters - **Triage tool**: High-sensitivity AI can flag suspicious nodules for priority review - **Resource-constrained settings**: Extends expert-level screening to underserved regions - **Standardization**: Reduces inter-reader variability in TI-RADS scoring ### Limitations 1. **Single dataset validation**: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed 2. **Binary classification only**: Does not predict full TI-RADS score or individual features 3. **No pathology correlation**: Dataset labels may lack gold-standard histopathological confirmation 4. **Test-validation gap**: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences 5. **Regulatory**: Research model only; not FDA/CE approved --- ## How to Use ```python from transformers import pipeline classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid") result = classifier("thyroid_ultrasound.jpg") print(result) # [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}] ``` --- ## Repository Contents | File | Description | |------|-------------| | `train_thyroid.py` | Full training script with SwinV2 fine-tuning | | `evaluate_simple.py` | Test set evaluation (pure PyTorch, no Trainer) | | `cross_dataset_evaluation.py` | Cross-dataset evaluation framework | | `generate_gradcam_locally.py` | Grad-CAM visualization generator | | `thyroid_metrics.json` | Complete test set metrics (JSON) | | `blog_post.md` | Detailed technical blog post | | `physician-guide.md` | Guide for clinicians replicating this workflow | --- ## Citation ```bibtex @misc{mlinter_thyroid_2026, title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2}, author={Johnyquest7}, year={2026}, howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}} } ``` --- *This project was developed as part of the ML-Intern program. Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts: [thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*