souvik-nlp
/

DualMedBert

@@ -20,71 +20,142 @@ metrics:
 - f1
 ---
-# DualMedBert
-Dual-Teacher Knowledge Distillation from BERT-base + PubMedBERT into
-DistilBERT + LoRA for drug review disease classification across 27 conditions.
-## Model Details
-- **Student**: distilbert-base-uncased + LoRA (r=10, α=40, layers 2–5)
-- **Teacher 1**: bert-base-uncased (general English)
-- **Teacher 2**: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
-- **Loss**: Focal CE + entropy-weighted adaptive dual-teacher KL divergence (T=4)
-- **Calibrator**: XGBoost on 31-dim engineered softmax features
-- **Task**: Multi-class text classification — 27 disease conditions
-- **Dataset**: UCI Drug Review Dataset (Gräßer et al., 2018)
-## Results
-| Model | Macro F1 | E2E Latency | Calibrator AUROC |
-|-------|----------|-------------|-----------------|
-| BERT-base | 0.8334 | 18.4ms | — |
-| PubMedBERT | 0.8553 | 18.3ms | — |
-| **DualMedBert (Ours)** | **0.8092** | **9.8ms** | **0.8938** |
-- **1.88× faster** than BERT-base at inference
-- **97.1%** of BERT-base F1 retained
-- Calibrator correctly identifies reliable predictions **89.4%** of the time
-## How to Load
-> ⚠️ The student uses custom LoRA layers. You MUST rebuild the architecture
-> before loading weights. See reload instructions below.
-```python
-import torch, json, joblib
-from huggingface_hub import hf_hub_download
-REPO   = "DeadMann026/DualMedBert"
-DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-# 1. Load config
-cfg = json.load(open(hf_hub_download(REPO, "config.json")))
-# 2. Rebuild student architecture (must do before loading weights)
-student = StudentModel(num_classes=cfg["num_labels"]).to(DEVICE)
-# 3. Load weights
-student.load_state_dict(torch.load(
-    hf_hub_download(REPO, "student_weights.pt"), map_location=DEVICE))
-student.eval()
-# 4. Load XGBoost calibrator
-xgb_cal = joblib.load(hf_hub_download(REPO, "xgb_calibrator.pkl"))
-# 5. Load label map
-id_to_label = {int(k): v for k, v in
-    json.load(open(hf_hub_download(REPO, "label_map.json"))).items()}
-print("DualMedBert loaded and ready.")
-```
-## Dataset
-UCI Drug Review Dataset — Gräßer et al., 2018
-DOI: https://doi.org/10.24432/C5SK5S
-ACM: https://doi.org/10.1145/3194658.3194677
-Download From Kaggle: https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018
-## Citation
 If you use this model, please cite:
-- Hinton et al. (2015) — Knowledge Distillation
-- Hu et al. (2022) — LoRA
-- Gräßer et al. (2018) — Dataset

 - f1
 ---
+# 🧠 DualMedBERT: Dual-Teacher Distilled Biomedical Classifier
+DualMedBERT is a fast and reliable biomedical text classifier trained using **dual-teacher knowledge distillation** from BERT-base and PubMedBERT into a lightweight DistilBERT model enhanced with LoRA.
+---
+# 🚀 Key Highlights
+* ⚡ **~1.8× faster** than BERT-base
+* 🧠 Retains **~98.5% of BERT performance**
+* 🎯 Combines general + biomedical knowledge via dual-teacher KD
+* 📊 Confidence calibration with XGBoost (AUROC ≈ 0.89)
+* 🔬 Designed for **27-class disease classification**
+---
+# 🧩 Model Architecture
+## Student Model
+* Backbone: `distilbert-base-uncased`
+* LoRA:
+  * Rank: **r = 8**
+  * Alpha: **α = 32**
+  * Applied to layers **2–5**
+* Additional:
+  * Layer **1 partially unfrozen**
+* Pooling: CLS + attention pooling
+* Head: Dense classifier (27 classes)
+---
+## Teachers
+| Teacher    | Role                           |
+| ---------- | ------------------------------ |
+| BERT-base  | General language understanding |
+| PubMedBERT | Biomedical domain knowledge    |
+---
+# 🧠 Training Method
+## Dual-Teacher Knowledge Distillation
+Loss:
+[
+L = \alpha \cdot L_{KD} + (1 - \alpha) \cdot L_{Focal}
+]
+Where:
+* KD uses **two teachers**
+* Weights determined via **entropy-based confidence**
+* Temperature: **T = 4.0**
+* α (KD balance): **0.6**
+---
+# 📊 Confidence Calibration (XGBoost)
+Post-hoc calibrator predicts whether a prediction is correct.
+### Features (31 total):
+* 27 softmax probabilities
+* max probability
+* entropy
+* top-2 gap
+* top-3 sum
+---
+# 📈 Results
+| Model           | Macro F1   | Accuracy   | Latency    |
+| --------------- | ---------- | ---------- | ---------- |
+| BERT-base       | 0.8333     | 0.835      | ~16–18 ms  |
+| PubMedBERT      | 0.8553     | 0.855      | ~16–18 ms  |
+| **DualMedBERT** | **0.8207** | **0.8226** | **~10 ms** |
+---
+## 🔍 Calibration
+* AUROC: **0.898–0.903**
+* Reliability detection: **~83%**
+---
+# ⚙️ Training Details
+* Optimizer: AdamW
+* Learning rate: **2e-4 (student)**
+* Weight decay: **0.1**
+* Epochs: 12
+* KD temperature: 4.0
+* LoRA dropout: 0.05
+---
+# ⚠️ Important Notes
+* Slight (~1–2%) drop vs BERT-base
+* Adaptive teacher weights showed **limited variation (~0.45 / 0.55)**
+* Model prioritizes **speed + reliability over peak accuracy**
+---
+# 📂 Dataset
+UCI Drug Review Dataset (Gräßer et al., 2018)
+---
+# 📚 Citation
 If you use this model, please cite:
+* Hinton et al., 2015 — Knowledge Distillation
+* Hu et al., 2022 — LoRA
+* Sanh et al., 2019 — DistilBERT
+* Devlin et al., 2018 — BERT
+* Gu et al., 2021 — PubMedBERT
+* Lin et al., 2017 — Focal Loss
+* Chen & Guestrin, 2016 — XGBoost
+* Gräßer et al., 2018 — Dataset
+---
+# 🏁 Summary
+DualMedBERT demonstrates that:
+> A distilled model can retain **~98.5% performance of BERT** while achieving **~1.8× speedup** and improved reliability via calibration.
+---