Souvik Sinha commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -20,71 +20,142 @@ metrics:
|
|
| 20 |
- f1
|
| 21 |
---
|
| 22 |
|
| 23 |
-
#
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
If you use this model, please cite:
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
- f1
|
| 21 |
---
|
| 22 |
|
| 23 |
+
# π§ DualMedBERT: Dual-Teacher Distilled Biomedical Classifier
|
| 24 |
+
|
| 25 |
+
DualMedBERT is a fast and reliable biomedical text classifier trained using **dual-teacher knowledge distillation** from BERT-base and PubMedBERT into a lightweight DistilBERT model enhanced with LoRA.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
# π Key Highlights
|
| 30 |
+
|
| 31 |
+
* β‘ **~1.8Γ faster** than BERT-base
|
| 32 |
+
* π§ Retains **~98.5% of BERT performance**
|
| 33 |
+
* π― Combines general + biomedical knowledge via dual-teacher KD
|
| 34 |
+
* π Confidence calibration with XGBoost (AUROC β 0.89)
|
| 35 |
+
* π¬ Designed for **27-class disease classification**
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
# π§© Model Architecture
|
| 40 |
+
|
| 41 |
+
## Student Model
|
| 42 |
+
|
| 43 |
+
* Backbone: `distilbert-base-uncased`
|
| 44 |
+
* LoRA:
|
| 45 |
+
|
| 46 |
+
* Rank: **r = 8**
|
| 47 |
+
* Alpha: **Ξ± = 32**
|
| 48 |
+
* Applied to layers **2β5**
|
| 49 |
+
* Additional:
|
| 50 |
+
|
| 51 |
+
* Layer **1 partially unfrozen**
|
| 52 |
+
* Pooling: CLS + attention pooling
|
| 53 |
+
* Head: Dense classifier (27 classes)
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Teachers
|
| 58 |
+
|
| 59 |
+
| Teacher | Role |
|
| 60 |
+
| ---------- | ------------------------------ |
|
| 61 |
+
| BERT-base | General language understanding |
|
| 62 |
+
| PubMedBERT | Biomedical domain knowledge |
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
# π§ Training Method
|
| 67 |
+
|
| 68 |
+
## Dual-Teacher Knowledge Distillation
|
| 69 |
+
|
| 70 |
+
Loss:
|
| 71 |
+
[
|
| 72 |
+
L = \alpha \cdot L_{KD} + (1 - \alpha) \cdot L_{Focal}
|
| 73 |
+
]
|
| 74 |
+
|
| 75 |
+
Where:
|
| 76 |
+
|
| 77 |
+
* KD uses **two teachers**
|
| 78 |
+
* Weights determined via **entropy-based confidence**
|
| 79 |
+
* Temperature: **T = 4.0**
|
| 80 |
+
* Ξ± (KD balance): **0.6**
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
# π Confidence Calibration (XGBoost)
|
| 85 |
+
|
| 86 |
+
Post-hoc calibrator predicts whether a prediction is correct.
|
| 87 |
+
|
| 88 |
+
### Features (31 total):
|
| 89 |
+
|
| 90 |
+
* 27 softmax probabilities
|
| 91 |
+
* max probability
|
| 92 |
+
* entropy
|
| 93 |
+
* top-2 gap
|
| 94 |
+
* top-3 sum
|
| 95 |
+
|
| 96 |
+
---
|
| 97 |
+
|
| 98 |
+
# π Results
|
| 99 |
+
|
| 100 |
+
| Model | Macro F1 | Accuracy | Latency |
|
| 101 |
+
| --------------- | ---------- | ---------- | ---------- |
|
| 102 |
+
| BERT-base | 0.8333 | 0.835 | ~16β18 ms |
|
| 103 |
+
| PubMedBERT | 0.8553 | 0.855 | ~16β18 ms |
|
| 104 |
+
| **DualMedBERT** | **0.8207** | **0.8226** | **~10 ms** |
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## π Calibration
|
| 109 |
+
|
| 110 |
+
* AUROC: **0.898β0.903**
|
| 111 |
+
* Reliability detection: **~83%**
|
| 112 |
+
|
| 113 |
+
---
|
| 114 |
+
|
| 115 |
+
# βοΈ Training Details
|
| 116 |
+
|
| 117 |
+
* Optimizer: AdamW
|
| 118 |
+
* Learning rate: **2e-4 (student)**
|
| 119 |
+
* Weight decay: **0.1**
|
| 120 |
+
* Epochs: 12
|
| 121 |
+
* KD temperature: 4.0
|
| 122 |
+
* LoRA dropout: 0.05
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
# β οΈ Important Notes
|
| 127 |
+
|
| 128 |
+
* Slight (~1β2%) drop vs BERT-base
|
| 129 |
+
* Adaptive teacher weights showed **limited variation (~0.45 / 0.55)**
|
| 130 |
+
* Model prioritizes **speed + reliability over peak accuracy**
|
| 131 |
+
|
| 132 |
+
---
|
| 133 |
+
|
| 134 |
+
# π Dataset
|
| 135 |
+
|
| 136 |
+
UCI Drug Review Dataset (GrΓ€Γer et al., 2018)
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
# π Citation
|
| 141 |
+
|
| 142 |
If you use this model, please cite:
|
| 143 |
+
|
| 144 |
+
* Hinton et al., 2015 β Knowledge Distillation
|
| 145 |
+
* Hu et al., 2022 β LoRA
|
| 146 |
+
* Sanh et al., 2019 β DistilBERT
|
| 147 |
+
* Devlin et al., 2018 β BERT
|
| 148 |
+
* Gu et al., 2021 β PubMedBERT
|
| 149 |
+
* Lin et al., 2017 β Focal Loss
|
| 150 |
+
* Chen & Guestrin, 2016 β XGBoost
|
| 151 |
+
* GrΓ€Γer et al., 2018 β Dataset
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
# π Summary
|
| 156 |
+
|
| 157 |
+
DualMedBERT demonstrates that:
|
| 158 |
+
|
| 159 |
+
> A distilled model can retain **~98.5% performance of BERT** while achieving **~1.8Γ speedup** and improved reliability via calibration.
|
| 160 |
+
|
| 161 |
+
---
|