Souvik Sinha commited on
Commit
845637a
Β·
verified Β·
1 Parent(s): 22c7a83

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +138 -67
README.md CHANGED
@@ -20,71 +20,142 @@ metrics:
20
  - f1
21
  ---
22
 
23
- # DualMedBert
24
-
25
- Dual-Teacher Knowledge Distillation from BERT-base + PubMedBERT into
26
- DistilBERT + LoRA for drug review disease classification across 27 conditions.
27
-
28
- ## Model Details
29
- - **Student**: distilbert-base-uncased + LoRA (r=10, Ξ±=40, layers 2–5)
30
- - **Teacher 1**: bert-base-uncased (general English)
31
- - **Teacher 2**: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
32
- - **Loss**: Focal CE + entropy-weighted adaptive dual-teacher KL divergence (T=4)
33
- - **Calibrator**: XGBoost on 31-dim engineered softmax features
34
- - **Task**: Multi-class text classification β€” 27 disease conditions
35
- - **Dataset**: UCI Drug Review Dataset (GrÀßer et al., 2018)
36
-
37
- ## Results
38
- | Model | Macro F1 | E2E Latency | Calibrator AUROC |
39
- |-------|----------|-------------|-----------------|
40
- | BERT-base | 0.8334 | 18.4ms | β€” |
41
- | PubMedBERT | 0.8553 | 18.3ms | β€” |
42
- | **DualMedBert (Ours)** | **0.8092** | **9.8ms** | **0.8938** |
43
-
44
- - **1.88Γ— faster** than BERT-base at inference
45
- - **97.1%** of BERT-base F1 retained
46
- - Calibrator correctly identifies reliable predictions **89.4%** of the time
47
-
48
- ## How to Load
49
-
50
- > ⚠️ The student uses custom LoRA layers. You MUST rebuild the architecture
51
- > before loading weights. See reload instructions below.
52
- ```python
53
- import torch, json, joblib
54
- from huggingface_hub import hf_hub_download
55
-
56
- REPO = "DeadMann026/DualMedBert"
57
- DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
58
-
59
- # 1. Load config
60
- cfg = json.load(open(hf_hub_download(REPO, "config.json")))
61
-
62
- # 2. Rebuild student architecture (must do before loading weights)
63
- student = StudentModel(num_classes=cfg["num_labels"]).to(DEVICE)
64
-
65
- # 3. Load weights
66
- student.load_state_dict(torch.load(
67
- hf_hub_download(REPO, "student_weights.pt"), map_location=DEVICE))
68
- student.eval()
69
-
70
- # 4. Load XGBoost calibrator
71
- xgb_cal = joblib.load(hf_hub_download(REPO, "xgb_calibrator.pkl"))
72
-
73
- # 5. Load label map
74
- id_to_label = {int(k): v for k, v in
75
- json.load(open(hf_hub_download(REPO, "label_map.json"))).items()}
76
-
77
- print("DualMedBert loaded and ready.")
78
- ```
79
-
80
- ## Dataset
81
- UCI Drug Review Dataset β€” GrÀßer et al., 2018
82
- DOI: https://doi.org/10.24432/C5SK5S
83
- ACM: https://doi.org/10.1145/3194658.3194677
84
- Download From Kaggle: https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018
85
-
86
- ## Citation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  If you use this model, please cite:
88
- - Hinton et al. (2015) β€” Knowledge Distillation
89
- - Hu et al. (2022) β€” LoRA
90
- - GrÀßer et al. (2018) β€” Dataset
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  - f1
21
  ---
22
 
23
+ # 🧠 DualMedBERT: Dual-Teacher Distilled Biomedical Classifier
24
+
25
+ DualMedBERT is a fast and reliable biomedical text classifier trained using **dual-teacher knowledge distillation** from BERT-base and PubMedBERT into a lightweight DistilBERT model enhanced with LoRA.
26
+
27
+ ---
28
+
29
+ # πŸš€ Key Highlights
30
+
31
+ * ⚑ **~1.8Γ— faster** than BERT-base
32
+ * 🧠 Retains **~98.5% of BERT performance**
33
+ * 🎯 Combines general + biomedical knowledge via dual-teacher KD
34
+ * πŸ“Š Confidence calibration with XGBoost (AUROC β‰ˆ 0.89)
35
+ * πŸ”¬ Designed for **27-class disease classification**
36
+
37
+ ---
38
+
39
+ # 🧩 Model Architecture
40
+
41
+ ## Student Model
42
+
43
+ * Backbone: `distilbert-base-uncased`
44
+ * LoRA:
45
+
46
+ * Rank: **r = 8**
47
+ * Alpha: **Ξ± = 32**
48
+ * Applied to layers **2–5**
49
+ * Additional:
50
+
51
+ * Layer **1 partially unfrozen**
52
+ * Pooling: CLS + attention pooling
53
+ * Head: Dense classifier (27 classes)
54
+
55
+ ---
56
+
57
+ ## Teachers
58
+
59
+ | Teacher | Role |
60
+ | ---------- | ------------------------------ |
61
+ | BERT-base | General language understanding |
62
+ | PubMedBERT | Biomedical domain knowledge |
63
+
64
+ ---
65
+
66
+ # 🧠 Training Method
67
+
68
+ ## Dual-Teacher Knowledge Distillation
69
+
70
+ Loss:
71
+ [
72
+ L = \alpha \cdot L_{KD} + (1 - \alpha) \cdot L_{Focal}
73
+ ]
74
+
75
+ Where:
76
+
77
+ * KD uses **two teachers**
78
+ * Weights determined via **entropy-based confidence**
79
+ * Temperature: **T = 4.0**
80
+ * Ξ± (KD balance): **0.6**
81
+
82
+ ---
83
+
84
+ # πŸ“Š Confidence Calibration (XGBoost)
85
+
86
+ Post-hoc calibrator predicts whether a prediction is correct.
87
+
88
+ ### Features (31 total):
89
+
90
+ * 27 softmax probabilities
91
+ * max probability
92
+ * entropy
93
+ * top-2 gap
94
+ * top-3 sum
95
+
96
+ ---
97
+
98
+ # πŸ“ˆ Results
99
+
100
+ | Model | Macro F1 | Accuracy | Latency |
101
+ | --------------- | ---------- | ---------- | ---------- |
102
+ | BERT-base | 0.8333 | 0.835 | ~16–18 ms |
103
+ | PubMedBERT | 0.8553 | 0.855 | ~16–18 ms |
104
+ | **DualMedBERT** | **0.8207** | **0.8226** | **~10 ms** |
105
+
106
+ ---
107
+
108
+ ## πŸ” Calibration
109
+
110
+ * AUROC: **0.898–0.903**
111
+ * Reliability detection: **~83%**
112
+
113
+ ---
114
+
115
+ # βš™οΈ Training Details
116
+
117
+ * Optimizer: AdamW
118
+ * Learning rate: **2e-4 (student)**
119
+ * Weight decay: **0.1**
120
+ * Epochs: 12
121
+ * KD temperature: 4.0
122
+ * LoRA dropout: 0.05
123
+
124
+ ---
125
+
126
+ # ⚠️ Important Notes
127
+
128
+ * Slight (~1–2%) drop vs BERT-base
129
+ * Adaptive teacher weights showed **limited variation (~0.45 / 0.55)**
130
+ * Model prioritizes **speed + reliability over peak accuracy**
131
+
132
+ ---
133
+
134
+ # πŸ“‚ Dataset
135
+
136
+ UCI Drug Review Dataset (GrÀßer et al., 2018)
137
+
138
+ ---
139
+
140
+ # πŸ“š Citation
141
+
142
  If you use this model, please cite:
143
+
144
+ * Hinton et al., 2015 β€” Knowledge Distillation
145
+ * Hu et al., 2022 β€” LoRA
146
+ * Sanh et al., 2019 β€” DistilBERT
147
+ * Devlin et al., 2018 β€” BERT
148
+ * Gu et al., 2021 β€” PubMedBERT
149
+ * Lin et al., 2017 β€” Focal Loss
150
+ * Chen & Guestrin, 2016 β€” XGBoost
151
+ * GrÀßer et al., 2018 β€” Dataset
152
+
153
+ ---
154
+
155
+ # 🏁 Summary
156
+
157
+ DualMedBERT demonstrates that:
158
+
159
+ > A distilled model can retain **~98.5% performance of BERT** while achieving **~1.8Γ— speedup** and improved reliability via calibration.
160
+
161
+ ---