| # Thyroid Ultrasound Nodule Malignancy Classification with SwinV2 |
|
|
| ## TL;DR |
|
|
| We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity** on the held-out test set β substantially exceeding published benchmarks. |
|
|
| - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid) |
| - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset) |
| - **Task**: Binary classification (benign vs malignant) |
| - **Architecture**: SwinV2-Base (86.9M parameters) |
| - **Test Set**: 499 samples (310 benign, 189 malignant) |
|
|
| **Key Clinical Metrics (Test Set):** |
| | Metric | Value | |
| |--------|-------| |
| | Accuracy | **96.4%** | |
| | AUC-ROC | **98.7%** | |
| | Sensitivity (Recall) | **93.7%** | |
| | Specificity | **98.1%** | |
| | PPV (Precision) | **96.7%** | |
| | NPV | **96.2%** | |
| | F1 Score | **96.4%** | |
|
|
| --- |
|
|
| ## Background: Thyroid Nodule Risk Stratification |
|
|
| Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored. |
|
|
| The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features: |
| 1. **Composition** (cystic, mixed, solid) |
| 2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic) |
| 3. **Shape** (wider-than-tall vs taller-than-wide) |
| 4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension) |
| 5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate) |
|
|
| While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems. |
|
|
| --- |
|
|
| ## Dataset |
|
|
| We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains: |
|
|
| | Split | Images | Benign (0) | Malignant (1) | |
| |-------|--------|-----------|---------------| |
| | Train | 1,993 | 1,236 | 757 | |
| | Validation | 499 | 310 | 189 | |
| | Test (held-out) | 623 | 358 | 265 | |
|
|
| - **Modality**: Grayscale ultrasound |
| - **Image sizes**: Variable (~270Γ270 to ~510Γ370) |
| - **Class balance**: ~62% benign, ~38% malignant |
|
|
| We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation. |
|
|
| --- |
|
|
| ## Model Architecture |
|
|
| We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons: |
|
|
| 1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins) |
| 2. **High resolution support**: The 256Γ256 input resolution preserves fine-grained ultrasound detail |
| 3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features |
| 4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks |
|
|
| ### Training Configuration |
|
|
| | Hyperparameter | Value | |
| |----------------|-------| |
| | Learning rate | 2e-5 | |
| | Batch size | 16 per device | |
| | Gradient accumulation | 2 steps | |
| | Effective batch size | 32 | |
| | Epochs | 30 (early stopping patience=5) | |
| | Warmup steps | 100 | |
| | Weight decay | 0.01 | |
| | Optimizer | AdamW | |
| | Precision | bf16 | |
| | Augmentation | Random rotation (Β±10Β°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter | |
| | Metric for best model | ROC-AUC | |
|
|
| --- |
|
|
| ## Results |
|
|
| ### Final Test Set Performance (Held-Out) |
|
|
| | Metric | Value | Clinical Interpretation | |
| |--------|-------|------------------------| |
| | **Accuracy** | **96.4%** | Overall correct prediction rate | |
| | **AUC-ROC** | **98.7%** | Discrimination between benign and malignant | |
| | **Sensitivity** | **93.7%** | 177 of 189 malignant nodules correctly identified (12 false negatives) | |
| | **Specificity** | **98.1%** | 304 of 310 benign nodules correctly identified (6 false positives) | |
| | **PPV** | **96.7%** | Of 183 flagged malignant, 177 were actually malignant | |
| | **NPV** | **96.2%** | Of 316 flagged benign, 304 were actually benign | |
| | **F1 Score** | **96.4%** | Harmonic mean of precision and recall | |
|
|
| **Confusion Matrix:** |
| ``` |
| Predicted |
| Benign Malignant |
| Benign 304 6 |
| Malignant 12 177 |
| ``` |
|
|
| **Per-Class Performance:** |
| | Class | Precision | Recall (Sensitivity) | F1 | |
| |-------|-----------|---------------------|-----| |
| | Benign | 96.2% | 98.1% | 97.1% | |
| | Malignant | 96.7% | 93.7% | 95.2% | |
|
|
| --- |
|
|
| ## Comparison with Published Benchmarks |
|
|
| | Model / Study | Year | Dataset | AUC | Accuracy | Sensitivity | Specificity | Notes | |
| |---------------|------|---------|-----|----------|-------------|-------------|-------| |
| | **Human Radiologists** | 2025 | 100 nodules | β | β | ~65% | ~20% | Published benchmark | |
| | **ResNet-18 Baseline** | 2025 | TN3K | β | ~80% | β | β | Standard CNN | |
| | **PEMV-Thyroid** | 2025 | TN3K | β | 82.08% | β | β | Multi-view ResNet-18 | |
| | **PEMV-Thyroid** | 2025 | TN5000 | β | 86.50% | β | β | Best public CNN | |
| | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | β | β | β | Foundation model, 4.5M images | |
| | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% | β | β | β | EfficientNet-B4 + FPN | |
| | **Ours (SwinV2)** | **2026** | **BTX24** | **98.7%** | **96.4%** | **93.7%** | **98.1%** | **Task-specific fine-tuning** | |
|
|
| ### Key Observations |
|
|
| 1. **Substantially surpasses EchoCare**: 98.7% vs 86.5% AUC despite ~100Γ less training data |
| 2. **Exceeds FM_UIA baseline**: 98.7% vs 91.6% AUC |
| 3. **Far exceeds radiologist sensitivity**: 93.7% vs ~65% published |
| 4. **Excellent specificity**: 98.1% minimizes unnecessary biopsies |
| |
| --- |
| |
| ## TN3K Cross-Dataset Evaluation |
| |
| **The TN3K dataset (`haifan-gong/TN3K`) is a segmentation dataset**, not a classification dataset. It contains: |
| - Ultrasound images + **pixel-level nodule masks** |
| - Labels are `test-image` (0) and `test-mask` (1) β **no benign/malignant labels** |
| |
| TN3K is designed for **nodule detection/segmentation** tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly. |
| |
| **For true cross-dataset validation**, the following datasets would be needed: |
| - **TN5000**: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025) |
| - **ThyroidXL**: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated) |
| - **Custom hospital dataset**: With histopathological confirmation |
| |
| Scripts for cross-dataset evaluation are included in this repo (`cross_dataset_evaluation.py`). |
| |
| --- |
| |
| ## Clinical Relevance and Limitations |
| |
| ### Why This Matters |
| - **Triage tool**: High-sensitivity AI can flag suspicious nodules for priority review |
| - **Resource-constrained settings**: Extends expert-level screening to underserved regions |
| - **Standardization**: Reduces inter-reader variability in TI-RADS scoring |
| |
| ### Limitations |
| 1. **Single dataset validation**: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed |
| 2. **Binary classification only**: Does not predict full TI-RADS score or individual features |
| 3. **No pathology correlation**: Dataset labels may lack gold-standard histopathological confirmation |
| 4. **Test-validation gap**: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences |
| 5. **Regulatory**: Research model only; not FDA/CE approved |
| |
| --- |
| |
| ## How to Use |
| |
| ```python |
| from transformers import pipeline |
| |
| classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid") |
| result = classifier("thyroid_ultrasound.jpg") |
| print(result) |
| # [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}] |
| ``` |
| |
| --- |
| |
| ## Repository Contents |
| |
| | File | Description | |
| |------|-------------| |
| | `train_thyroid.py` | Full training script with SwinV2 fine-tuning | |
| | `evaluate_simple.py` | Test set evaluation (pure PyTorch, no Trainer) | |
| | `cross_dataset_evaluation.py` | Cross-dataset evaluation framework | |
| | `generate_gradcam_locally.py` | Grad-CAM visualization generator | |
| | `thyroid_metrics.json` | Complete test set metrics (JSON) | |
| | `blog_post.md` | Detailed technical blog post | |
| | `physician-guide.md` | Guide for clinicians replicating this workflow | |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{mlinter_thyroid_2026, |
| title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2}, |
| author={Johnyquest7}, |
| year={2026}, |
| howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}} |
| } |
| ``` |
| |
| --- |
| |
| *This project was developed as part of the ML-Intern program. Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts: [thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).* |
| |