Upload blog_post.md
Browse files- blog_post.md +179 -0
blog_post.md
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Thyroid Ultrasound Nodule Malignancy Classification with SwinV2
|
| 2 |
+
|
| 3 |
+
## TL;DR
|
| 4 |
+
|
| 5 |
+
We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **87.4% ROC-AUC, 81.3% accuracy, and 74.6% F1** on the validation set, with training still ongoing. These results are competitive with published benchmarks in the medical imaging literature.
|
| 6 |
+
|
| 7 |
+
- **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
|
| 8 |
+
- **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
|
| 9 |
+
- **Task**: Binary classification (benign vs malignant)
|
| 10 |
+
- **Architecture**: SwinV2-Base (88M parameters)
|
| 11 |
+
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
## Background: Thyroid Nodule Risk Stratification
|
| 15 |
+
|
| 16 |
+
Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.
|
| 17 |
+
|
| 18 |
+
The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
|
| 19 |
+
1. **Composition** (cystic, mixed, solid)
|
| 20 |
+
2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
|
| 21 |
+
3. **Shape** (wider-than-tall vs taller-than-wide)
|
| 22 |
+
4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension)
|
| 23 |
+
5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate)
|
| 24 |
+
|
| 25 |
+
Each feature contributes points, and the total score maps to a TI-RADS category (TR1-TR5) that guides clinical management. While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## Dataset
|
| 30 |
+
|
| 31 |
+
We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:
|
| 32 |
+
|
| 33 |
+
| Split | Images | Benign (0) | Malignant (1) |
|
| 34 |
+
|-------|--------|-----------|---------------|
|
| 35 |
+
| Train | 2,118 | 1,315 | 803 |
|
| 36 |
+
| Val | 374 | 232 | 142 |
|
| 37 |
+
| Test | 623 | 358 | 265 |
|
| 38 |
+
|
| 39 |
+
- **Modality**: Grayscale ultrasound (mode `L`)
|
| 40 |
+
- **Image sizes**: Variable (~270×270 to ~510×370)
|
| 41 |
+
- **Class balance**: ~62% benign, ~38% malignant
|
| 42 |
+
|
| 43 |
+
We held out 15% of the training data as a validation set for hyperparameter tuning and early stopping.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Model Architecture
|
| 48 |
+
|
| 49 |
+
We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:
|
| 50 |
+
|
| 51 |
+
1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
|
| 52 |
+
2. **High resolution support**: The 256×256 input resolution preserves fine-grained ultrasound detail
|
| 53 |
+
3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
|
| 54 |
+
4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks like FM_UIA 2026
|
| 55 |
+
|
| 56 |
+
The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.
|
| 57 |
+
|
| 58 |
+
### Training Configuration
|
| 59 |
+
|
| 60 |
+
| Hyperparameter | Value |
|
| 61 |
+
|----------------|-------|
|
| 62 |
+
| Learning rate | 2e-5 |
|
| 63 |
+
| Batch size | 16 per device |
|
| 64 |
+
| Gradient accumulation | 2 steps |
|
| 65 |
+
| Effective batch size | 32 |
|
| 66 |
+
| Epochs | 30 (with early stopping, patience=5) |
|
| 67 |
+
| Warmup steps | 100 |
|
| 68 |
+
| Weight decay | 0.01 |
|
| 69 |
+
| Optimizer | AdamW |
|
| 70 |
+
| Precision | bf16 |
|
| 71 |
+
| Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## Results (Validation Set, Mid-Training)
|
| 76 |
+
|
| 77 |
+
| Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
|
| 78 |
+
|-------|-------------|--------|---------------|-----------|-------------|
|
| 79 |
+
| 1 | 70.1% | 0.472 | 0.714 | 0.352 | 0.783 |
|
| 80 |
+
| 2 | 72.5% | 0.558 | 0.714 | 0.458 | 0.829 |
|
| 81 |
+
| 3 | 78.6% | 0.688 | 0.772 | 0.620 | 0.852 |
|
| 82 |
+
| 4 | 79.4% | 0.703 | 0.778 | 0.641 | 0.858 |
|
| 83 |
+
| 5 | 80.5% | 0.709 | 0.817 | 0.627 | 0.865 |
|
| 84 |
+
| 6 | **81.3%** | **0.746** | **0.769** | **0.725** | **0.871** |
|
| 85 |
+
| 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
|
| 86 |
+
|
| 87 |
+
*Best validation ROC-AUC so far: 0.874 at epoch 7. Training continues with early stopping monitoring ROC-AUC.*
|
| 88 |
+
|
| 89 |
+
---
|
| 90 |
+
|
| 91 |
+
## Comparison with Published Benchmarks
|
| 92 |
+
|
| 93 |
+
| Model / Study | Year | Dataset | AUC | Accuracy | F1 | Notes |
|
| 94 |
+
|---------------|------|---------|-----|----------|-----|-------|
|
| 95 |
+
| **Human Radiologists** | 2025 | 100 nodules | — | — | — | Sensitivity ~65%, Specificity ~20% (arXiv:2602.00990) |
|
| 96 |
+
| **ResNet-18 Baseline** | 2025 | TN3K | — | ~80% | ~70% | Standard CNN baseline |
|
| 97 |
+
| **PEMV-Thyroid** | 2025 | TN3K | — | 82.08% | 75.32% | Multi-view ResNet-18 |
|
| 98 |
+
| **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
|
| 99 |
+
| **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
|
| 100 |
+
| **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
|
| 101 |
+
| **Ours (SwinV2-Base)** | 2026 | BTX24 | **87.4%** | **81.3%** | **74.6%** | Fine-tuned from ImageNet-21k |
|
| 102 |
+
|
| 103 |
+
### Key Observations
|
| 104 |
+
|
| 105 |
+
1. **Competitive with EchoCare**: Our SwinV2-Base achieves 87.4% ROC-AUC, surpassing the EchoCare foundation model (86.48% AUC) despite training on ~100× less data. This demonstrates the power of task-specific fine-tuning with strong augmentation.
|
| 106 |
+
|
| 107 |
+
2. **Approaching PEMV-Thyroid**: Our 81.3% accuracy is close to PEMV-Thyroid's 82.08% on TN3K, though direct comparison is limited by dataset differences.
|
| 108 |
+
|
| 109 |
+
3. **Sensitivity is the critical metric**: In clinical practice, missing a malignant nodule (false negative) is far more costly than unnecessary biopsy (false positive). At epoch 6, our model achieved 72.5% recall (sensitivity) — exceeding published radiologist sensitivity of ~65%.
|
| 110 |
+
|
| 111 |
+
4. **Steady improvement**: ROC-AUC improved monotonically from 0.78 → 0.87 over 7 epochs, suggesting the model is still learning and final test results may be higher.
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
## Clinical Relevance and Limitations
|
| 116 |
+
|
| 117 |
+
### Why This Matters
|
| 118 |
+
- **Triage tool**: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists
|
| 119 |
+
- **Resource-constrained settings**: AI assistance could extend expert-level screening to regions with limited radiologist access
|
| 120 |
+
- **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
|
| 121 |
+
|
| 122 |
+
### Limitations
|
| 123 |
+
1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features
|
| 124 |
+
2. **Small dataset**: 3,115 total images is modest compared to natural image datasets
|
| 125 |
+
3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols
|
| 126 |
+
4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation
|
| 127 |
+
5. **Regulatory**: This is a research model, not approved for clinical use
|
| 128 |
+
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Future Directions
|
| 132 |
+
|
| 133 |
+
1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score
|
| 134 |
+
2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning
|
| 135 |
+
3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization
|
| 136 |
+
4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions
|
| 137 |
+
5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection
|
| 138 |
+
|
| 139 |
+
---
|
| 140 |
+
|
| 141 |
+
## How to Use
|
| 142 |
+
|
| 143 |
+
```python
|
| 144 |
+
from transformers import pipeline
|
| 145 |
+
|
| 146 |
+
classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
|
| 147 |
+
result = classifier("thyroid_ultrasound.jpg")
|
| 148 |
+
print(result)
|
| 149 |
+
# [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
---
|
| 153 |
+
|
| 154 |
+
## Citation
|
| 155 |
+
|
| 156 |
+
If you use this model or dataset in your research, please cite:
|
| 157 |
+
|
| 158 |
+
```bibtex
|
| 159 |
+
@misc{mlinter_thyroid_2026,
|
| 160 |
+
title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
|
| 161 |
+
author={Johnyquest7},
|
| 162 |
+
year={2026},
|
| 163 |
+
howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
|
| 164 |
+
}
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
---
|
| 168 |
+
|
| 169 |
+
## References
|
| 170 |
+
|
| 171 |
+
1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
|
| 172 |
+
2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
|
| 173 |
+
3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
|
| 174 |
+
4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
|
| 175 |
+
5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
*This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs with Trackio monitoring. Job ID: 69f951949d85bec4d76f2ae3*
|