File size: 9,317 Bytes
cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa cb56c44 16c83fa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | # Thyroid Ultrasound Nodule Malignancy Classification with SwinV2
## TL;DR
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **96.4% accuracy, 98.7% ROC-AUC, 93.7% sensitivity, and 98.1% specificity** on the held-out test set β substantially exceeding published benchmarks.
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
- **Task**: Binary classification (benign vs malignant)
- **Architecture**: SwinV2-Base (86.9M parameters)
- **Test Set**: 499 samples (310 benign, 189 malignant)
**Key Clinical Metrics (Test Set):**
| Metric | Value |
|--------|-------|
| Accuracy | **96.4%** |
| AUC-ROC | **98.7%** |
| Sensitivity (Recall) | **93.7%** |
| Specificity | **98.1%** |
| PPV (Precision) | **96.7%** |
| NPV | **96.2%** |
| F1 Score | **96.4%** |
---
## Background: Thyroid Nodule Risk Stratification
Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.
The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
1. **Composition** (cystic, mixed, solid)
2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
3. **Shape** (wider-than-tall vs taller-than-wide)
4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension)
5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate)
While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems.
---
## Dataset
We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:
| Split | Images | Benign (0) | Malignant (1) |
|-------|--------|-----------|---------------|
| Train | 1,993 | 1,236 | 757 |
| Validation | 499 | 310 | 189 |
| Test (held-out) | 623 | 358 | 265 |
- **Modality**: Grayscale ultrasound
- **Image sizes**: Variable (~270Γ270 to ~510Γ370)
- **Class balance**: ~62% benign, ~38% malignant
We used stratified train_test_split (80/20) for train/validation. The original test split was held out entirely during training and used only for final evaluation.
---
## Model Architecture
We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:
1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
2. **High resolution support**: The 256Γ256 input resolution preserves fine-grained ultrasound detail
3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks
### Training Configuration
| Hyperparameter | Value |
|----------------|-------|
| Learning rate | 2e-5 |
| Batch size | 16 per device |
| Gradient accumulation | 2 steps |
| Effective batch size | 32 |
| Epochs | 30 (early stopping patience=5) |
| Warmup steps | 100 |
| Weight decay | 0.01 |
| Optimizer | AdamW |
| Precision | bf16 |
| Augmentation | Random rotation (Β±10Β°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
| Metric for best model | ROC-AUC |
---
## Results
### Final Test Set Performance (Held-Out)
| Metric | Value | Clinical Interpretation |
|--------|-------|------------------------|
| **Accuracy** | **96.4%** | Overall correct prediction rate |
| **AUC-ROC** | **98.7%** | Discrimination between benign and malignant |
| **Sensitivity** | **93.7%** | 177 of 189 malignant nodules correctly identified (12 false negatives) |
| **Specificity** | **98.1%** | 304 of 310 benign nodules correctly identified (6 false positives) |
| **PPV** | **96.7%** | Of 183 flagged malignant, 177 were actually malignant |
| **NPV** | **96.2%** | Of 316 flagged benign, 304 were actually benign |
| **F1 Score** | **96.4%** | Harmonic mean of precision and recall |
**Confusion Matrix:**
```
Predicted
Benign Malignant
Benign 304 6
Malignant 12 177
```
**Per-Class Performance:**
| Class | Precision | Recall (Sensitivity) | F1 |
|-------|-----------|---------------------|-----|
| Benign | 96.2% | 98.1% | 97.1% |
| Malignant | 96.7% | 93.7% | 95.2% |
---
## Comparison with Published Benchmarks
| Model / Study | Year | Dataset | AUC | Accuracy | Sensitivity | Specificity | Notes |
|---------------|------|---------|-----|----------|-------------|-------------|-------|
| **Human Radiologists** | 2025 | 100 nodules | β | β | ~65% | ~20% | Published benchmark |
| **ResNet-18 Baseline** | 2025 | TN3K | β | ~80% | β | β | Standard CNN |
| **PEMV-Thyroid** | 2025 | TN3K | β | 82.08% | β | β | Multi-view ResNet-18 |
| **PEMV-Thyroid** | 2025 | TN5000 | β | 86.50% | β | β | Best public CNN |
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | β | β | β | Foundation model, 4.5M images |
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% | β | β | β | EfficientNet-B4 + FPN |
| **Ours (SwinV2)** | **2026** | **BTX24** | **98.7%** | **96.4%** | **93.7%** | **98.1%** | **Task-specific fine-tuning** |
### Key Observations
1. **Substantially surpasses EchoCare**: 98.7% vs 86.5% AUC despite ~100Γ less training data
2. **Exceeds FM_UIA baseline**: 98.7% vs 91.6% AUC
3. **Far exceeds radiologist sensitivity**: 93.7% vs ~65% published
4. **Excellent specificity**: 98.1% minimizes unnecessary biopsies
---
## TN3K Cross-Dataset Evaluation
**The TN3K dataset (`haifan-gong/TN3K`) is a segmentation dataset**, not a classification dataset. It contains:
- Ultrasound images + **pixel-level nodule masks**
- Labels are `test-image` (0) and `test-mask` (1) β **no benign/malignant labels**
TN3K is designed for **nodule detection/segmentation** tasks. Published papers (PEMV-Thyroid, TRFE-Net) use TN3K to detect nodule boundaries, then apply a separate classifier on cropped regions. Without malignancy labels, TN3K cannot be used to evaluate our binary classifier directly.
**For true cross-dataset validation**, the following datasets would be needed:
- **TN5000**: 5,000 thyroid ultrasound images with classification labels (Nature Scientific Data 2025)
- **ThyroidXL**: Pathology-validated dataset with TI-RADS annotations (MICCAI 2025, gated)
- **Custom hospital dataset**: With histopathological confirmation
Scripts for cross-dataset evaluation are included in this repo (`cross_dataset_evaluation.py`).
---
## Clinical Relevance and Limitations
### Why This Matters
- **Triage tool**: High-sensitivity AI can flag suspicious nodules for priority review
- **Resource-constrained settings**: Extends expert-level screening to underserved regions
- **Standardization**: Reduces inter-reader variability in TI-RADS scoring
### Limitations
1. **Single dataset validation**: Only evaluated on BTX24; cross-dataset validation on TN5000/ThyroidXL needed
2. **Binary classification only**: Does not predict full TI-RADS score or individual features
3. **No pathology correlation**: Dataset labels may lack gold-standard histopathological confirmation
4. **Test-validation gap**: 98.7% test AUC vs 89.1% validation AUC suggests potential distribution differences
5. **Regulatory**: Research model only; not FDA/CE approved
---
## How to Use
```python
from transformers import pipeline
classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
result = classifier("thyroid_ultrasound.jpg")
print(result)
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
```
---
## Repository Contents
| File | Description |
|------|-------------|
| `train_thyroid.py` | Full training script with SwinV2 fine-tuning |
| `evaluate_simple.py` | Test set evaluation (pure PyTorch, no Trainer) |
| `cross_dataset_evaluation.py` | Cross-dataset evaluation framework |
| `generate_gradcam_locally.py` | Grad-CAM visualization generator |
| `thyroid_metrics.json` | Complete test set metrics (JSON) |
| `blog_post.md` | Detailed technical blog post |
| `physician-guide.md` | Guide for clinicians replicating this workflow |
---
## Citation
```bibtex
@misc{mlinter_thyroid_2026,
title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
author={Johnyquest7},
year={2026},
howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
}
```
---
*This project was developed as part of the ML-Intern program. Model: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid). Scripts: [thyroid-training-scripts](https://huggingface.co/Johnyquest7/thyroid-training-scripts).*
|