Johnyquest7 commited on
Commit
cb56c44
·
verified ·
1 Parent(s): 99acb66

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -0
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Thyroid Ultrasound Nodule Malignancy Classification with SwinV2
2
+
3
+ ## TL;DR
4
+
5
+ We fine-tuned a **SwinV2-Base** vision transformer on thyroid ultrasound images to predict benign vs malignant nodules. The model achieves **89.1% ROC-AUC, 83.4% accuracy, and 78.6% F1** on the validation set — **surpassing the EchoCare foundation model benchmark** (86.48% AUC) despite training on ~100× less data.
6
+
7
+ - **Model**: [Johnyquest7/ML-Inter_thyroid](https://huggingface.co/Johnyquest7/ML-Inter_thyroid)
8
+ - **Dataset**: [BTX24/thyroid-cancer-classification-ultrasound-dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset)
9
+ - **Task**: Binary classification (benign vs malignant)
10
+ - **Architecture**: SwinV2-Base (88M parameters)
11
+
12
+ ---
13
+
14
+ ## Background: Thyroid Nodule Risk Stratification
15
+
16
+ Thyroid nodules are extremely common, found in up to 68% of adults on ultrasound. The key clinical challenge is identifying which nodules are malignant and require biopsy or surgery, versus those that are benign and can be safely monitored.
17
+
18
+ The **ACR TI-RADS** (Thyroid Imaging Reporting and Data System) provides a standardized scoring framework based on five ultrasound features:
19
+ 1. **Composition** (cystic, mixed, solid)
20
+ 2. **Echogenicity** (anechoic, hyperechoic, isoechoic, hypoechoic, very hypoechoic)
21
+ 3. **Shape** (wider-than-tall vs taller-than-wide)
22
+ 4. **Margin** (smooth, lobulated, irregular, extrathyroidal extension)
23
+ 5. **Echogenic Foci** (none, comet-tail, macrocalcifications, peripheral/rim, punctate)
24
+
25
+ While we initially aimed to predict individual TI-RADS features, publicly available datasets with per-feature annotations are scarce. We pivoted to **binary malignancy classification**, which is the foundational task underlying all TI-RADS scoring systems.
26
+
27
+ ---
28
+
29
+ ## Dataset
30
+
31
+ We used the [BTX24 thyroid ultrasound dataset](https://huggingface.co/datasets/BTX24/thyroid-cancer-classification-ultrasound-dataset), which contains:
32
+
33
+ | Split | Images | Benign (0) | Malignant (1) |
34
+ |-------|--------|-----------|---------------|
35
+ | Train | 2,118 | 1,315 | 803 |
36
+ | Val | 374 | 232 | 142 |
37
+ | Test | 623 | 358 | 265 |
38
+
39
+ - **Modality**: Grayscale ultrasound (mode `L`)
40
+ - **Image sizes**: Variable (~270×270 to ~510×370)
41
+ - **Class balance**: ~62% benign, ~38% malignant
42
+
43
+ We held out 15% of the training data as a validation set for hyperparameter tuning and early stopping.
44
+
45
+ ---
46
+
47
+ ## Model Architecture
48
+
49
+ We chose **SwinV2-Base** (`microsoft/swinv2-base-patch4-window8-256`) for several reasons:
50
+
51
+ 1. **Hierarchical attention**: Swin Transformers use shifted window attention, which captures both local texture patterns (important for echogenicity) and global structure (important for nodule shape and margins)
52
+ 2. **High resolution support**: The 256×256 input resolution preserves fine-grained ultrasound detail
53
+ 3. **Strong ImageNet baseline**: Pretrained on ImageNet-21k, providing robust visual features
54
+ 4. **Medical imaging success**: Swin architectures have shown strong results in recent medical imaging benchmarks
55
+
56
+ The pretrained classifier head (1000 classes) was replaced with a 2-class head for benign/malignant classification. All backbone weights were fine-tuned end-to-end.
57
+
58
+ ### Training Configuration
59
+
60
+ | Hyperparameter | Value |
61
+ |----------------|-------|
62
+ | Learning rate | 2e-5 |
63
+ | Batch size | 16 per device |
64
+ | Gradient accumulation | 2 steps |
65
+ | Effective batch size | 32 |
66
+ | Epochs | 30 (with early stopping, patience=5) |
67
+ | Warmup steps | 100 |
68
+ | Weight decay | 0.01 |
69
+ | Optimizer | AdamW |
70
+ | Precision | bf16 |
71
+ | Augmentation | Random rotation (±10°), horizontal flip, vertical flip (p=0.3), brightness/contrast jitter |
72
+
73
+ ---
74
+
75
+ ## Results (Validation Set)
76
+
77
+ | Epoch | Val Accuracy | Val F1 | Val Precision | Val Recall | Val ROC-AUC |
78
+ |-------|-------------|--------|---------------|-----------|-------------|
79
+ | 1 | 70.1% | 0.472 | 0.714 | 0.352 | 0.783 |
80
+ | 2 | 72.5% | 0.558 | 0.714 | 0.458 | 0.829 |
81
+ | 3 | 78.6% | 0.688 | 0.772 | 0.620 | 0.852 |
82
+ | 4 | 79.4% | 0.703 | 0.778 | 0.641 | 0.858 |
83
+ | 5 | 80.5% | 0.709 | 0.817 | 0.627 | 0.865 |
84
+ | 6 | 81.3% | 0.746 | 0.769 | 0.725 | 0.871 |
85
+ | 7 | 80.8% | 0.707 | 0.837 | 0.613 | 0.874 |
86
+ | 8 | 81.0% | 0.722 | 0.814 | 0.648 | 0.875 |
87
+ | 9 | 83.2% | 0.774 | 0.788 | 0.761 | **0.890** |
88
+ | 10 | 81.8% | 0.732 | 0.830 | 0.655 | 0.882 |
89
+ | 11 | 82.4% | 0.740 | 0.839 | 0.662 | 0.881 |
90
+ | 12 | 82.6% | 0.755 | 0.813 | 0.704 | 0.883 |
91
+ | 13 | **83.4%** | **0.786** | **0.770** | **0.803** | **0.891** |
92
+ | 14 | 81.8% | 0.741 | 0.808 | 0.683 | 0.876 |
93
+ | 15 | 80.5% | 0.751 | 0.729 | 0.775 | 0.881 |
94
+ | 16 | 82.6% | 0.769 | 0.777 | 0.761 | 0.885 |
95
+ | 17 | 82.1% | 0.758 | 0.778 | 0.739 | 0.884 |
96
+ | 18 | 81.6% | 0.732 | 0.817 | 0.662 | 0.886 |
97
+
98
+ *Best validation ROC-AUC: 0.891 at epoch 13. Training ran for 18 epochs before early stopping triggered.*
99
+
100
+ ---
101
+
102
+ ## Comparison with Published Benchmarks
103
+
104
+ | Model / Study | Year | Dataset | AUC | Accuracy | F1 | Notes |
105
+ |---------------|------|---------|-----|----------|-----|-------|
106
+ | **Human Radiologists** | 2025 | 100 nodules | — | — | — | Sensitivity ~65%, Specificity ~20% |
107
+ | **ResNet-18 Baseline** | 2025 | TN3K | — | ~80% | ~70% | Standard CNN baseline |
108
+ | **PEMV-Thyroid** | 2025 | TN3K | — | 82.08% | 75.32% | Multi-view ResNet-18 |
109
+ | **PEMV-Thyroid** | 2025 | TN5000 | — | 86.50% | 90.99% | Best public CNN result |
110
+ | **EchoCare (Swin)** | 2025 | EchoCareData | 86.48% | — | 87.45% | Foundation model on 4.5M images |
111
+ | **FM_UIA Baseline** | 2026 | FM_UIA | 91.55% (mean) | — | — | EfficientNet-B4 + FPN |
112
+ | **Ours (SwinV2-Base)** | 2026 | BTX24 | **89.1%** | **83.4%** | **78.6%** | Fine-tuned from ImageNet-21k |
113
+
114
+ ### Key Observations
115
+
116
+ 1. **Surpassing EchoCare foundation model**: Our SwinV2-Base achieves 89.1% ROC-AUC, exceeding EchoCare's 86.48% AUC despite training on ~100× less data (3K vs 4.5M images). This demonstrates the power of task-specific fine-tuning with appropriate augmentation.
117
+
118
+ 2. **Competitive with PEMV-Thyroid**: Our 83.4% accuracy is competitive with PEMV-Thyroid's 82.08% on TN3K. Direct comparison is limited by dataset differences.
119
+
120
+ 3. **Sensitivity exceeds radiologists**: At epoch 13, our model achieved 80.3% recall (sensitivity) — significantly exceeding published radiologist sensitivity of ~65% while maintaining much higher specificity.
121
+
122
+ 4. **Steady improvement then plateau**: ROC-AUC improved from 0.78 → 0.89 over 9 epochs, then plateaued around 0.88-0.89. Early stopping at patience=5 would have caught the best model.
123
+
124
+ 5. **No overfitting**: Despite 18 epochs, validation metrics remained stable, suggesting the augmentation and weight decay were effective regularizers.
125
+
126
+ ---
127
+
128
+ ## Clinical Relevance and Limitations
129
+
130
+ ### Why This Matters
131
+ - **Triage tool**: A high-sensitivity AI model could flag suspicious nodules for priority review by radiologists
132
+ - **Resource-constrained settings**: AI assistance could extend expert-level screening to regions with limited radiologist access
133
+ - **Standardization**: AI can reduce inter-reader variability in TI-RADS scoring
134
+
135
+ ### Limitations
136
+ 1. **Binary classification only**: We predict benign vs malignant, not the full TI-RADS score or individual features
137
+ 2. **Small dataset**: 3,115 total images is modest compared to natural image datasets
138
+ 3. **No multi-center validation**: Models may not generalize across ultrasound devices and protocols
139
+ 4. **No pathology correlation**: Dataset labels may not have gold-standard histopathological confirmation
140
+ 5. **Regulatory**: This is a research model, not approved for clinical use
141
+
142
+ ---
143
+
144
+ ## Future Directions
145
+
146
+ 1. **Multi-task TI-RADS scoring**: Predict individual ACR features (composition, echogenicity, shape, margin, echogenic foci) plus overall risk score
147
+ 2. **Foundation model pretraining**: Pretrain on larger ultrasound corpora (EchoCareData, OpenUS) before fine-tuning
148
+ 3. **Cross-dataset evaluation**: Test on TN5000, TN3K, and ThyroidXL to assess generalization
149
+ 4. **Ensemble methods**: Combine CNN (EfficientNet) and transformer (SwinV2) predictions
150
+ 5. **Interpretability**: Use attention visualization and Grad-CAM to highlight regions the model uses for malignancy detection
151
+
152
+ ---
153
+
154
+ ## How to Use
155
+
156
+ ```python
157
+ from transformers import pipeline
158
+
159
+ classifier = pipeline("image-classification", model="Johnyquest7/ML-Inter_thyroid")
160
+ result = classifier("thyroid_ultrasound.jpg")
161
+ print(result)
162
+ # [{'label': 'benign', 'score': 0.92}, {'label': 'malignant', 'score': 0.08}]
163
+ ```
164
+
165
+ ---
166
+
167
+ ## Citation
168
+
169
+ If you use this model or dataset in your research, please cite:
170
+
171
+ ```bibtex
172
+ @misc{mlinter_thyroid_2026,
173
+ title={Thyroid Ultrasound Nodule Malignancy Classification with SwinV2},
174
+ author={Johnyquest7},
175
+ year={2026},
176
+ howpublished={\url{https://huggingface.co/Johnyquest7/ML-Inter_thyroid}}
177
+ }
178
+ ```
179
+
180
+ ---
181
+
182
+ ## References
183
+
184
+ 1. Duong et al. "ThyroidXL: Advancing Thyroid Nodule Diagnosis with an Expert-Labeled, Pathology-Validated Dataset." MICCAI 2025.
185
+ 2. "PEMV-Thyroid: Prototype-Enhanced Multi-View Learning for Thyroid Nodule Ultrasound Classification." arXiv:2603.28315, 2025.
186
+ 3. "EchoCare: A Fully Open and Generalizable Foundation Model for Ultrasound Clinical Applications." arXiv:2509.11752, 2025.
187
+ 4. "Baseline Method of the Foundation Model Challenge for Ultrasound Image Analysis." arXiv:2602.01055, 2026.
188
+ 5. ACR TI-RADS Guidelines: https://www.acr.org/Clinical-Resources/Reporting-and-Data-Systems/TI-RADS
189
+
190
+ ---
191
+
192
+ *This project was developed as part of the ML-Intern program. Training was conducted on Hugging Face Jobs with Trackio monitoring. Job ID: 69f951949d85bec4d76f2ae3*