chrisvoncsefalvay commited on
Commit
2ba2967
Β·
verified Β·
1 Parent(s): bcff42b

Add comprehensive model card for CRAG-dual-encoder-base

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ tags:
6
+ - medical
7
+ - biomedical
8
+ - drug-safety
9
+ - adverse-drug-reactions
10
+ - pharmacovigilance
11
+ - relation-extraction
12
+ - dual-encoder
13
+ - clinical-nlp
14
+ - pubmedbert
15
+ datasets:
16
+ - ade-benchmark-corpus/ade_corpus_v2
17
+ metrics:
18
+ - f1
19
+ - roc_auc
20
+ pipeline_tag: text-classification
21
+ model-index:
22
+ - name: CRAG-dual-encoder-base
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ name: Drug-ADR Relation Extraction
27
+ dataset:
28
+ name: ADE Corpus V2
29
+ type: ade-benchmark-corpus/ade_corpus_v2
30
+ config: Ade_corpus_v2_drug_ade_relation
31
+ metrics:
32
+ - type: f1
33
+ value: 0.883
34
+ name: F1 Score
35
+ ---
36
+
37
+ # CRAG-dual-encoder-base
38
+
39
+ **CRAG: Causal Reasoning for Adversomics Graphs**
40
+
41
+ This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.
42
+
43
+ ## Model Description
44
+
45
+ CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.
46
+
47
+ ### Architecture
48
+
49
+ ```
50
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
51
+ β”‚ CRAG Dual-Encoder Base β”‚
52
+ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
53
+ β”‚ β”‚
54
+ β”‚ Drug Context ADR Context β”‚
55
+ β”‚ β”‚ β”‚ β”‚
56
+ β”‚ β–Ό β–Ό β”‚
57
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
58
+ β”‚ β”‚PubMedBERTβ”‚ β”‚PubMedBERTβ”‚ (separate weights) β”‚
59
+ β”‚ β”‚ Drug β”‚ β”‚ ADR β”‚ β”‚
60
+ β”‚ β”‚ Encoder β”‚ β”‚ Encoder β”‚ β”‚
61
+ β”‚ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
62
+ β”‚ β”‚ β”‚ β”‚
63
+ β”‚ β–Ό β–Ό β”‚
64
+ β”‚ [CLS] Pool [CLS] Pool β”‚
65
+ β”‚ β”‚ β”‚ β”‚
66
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
67
+ β”‚ β”‚ β”‚
68
+ β”‚ β–Ό β”‚
69
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
70
+ β”‚ β”‚ Bilinear β”‚ β”‚
71
+ β”‚ β”‚ Fusion β”‚ β”‚
72
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
73
+ β”‚ β”‚ β”‚
74
+ β”‚ β–Ό β”‚
75
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
76
+ β”‚ β”‚ MLP Head β”‚ β”‚
77
+ β”‚ β”‚ (256β†’1) β”‚ β”‚
78
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
79
+ β”‚ β”‚ β”‚
80
+ β”‚ β–Ό β”‚
81
+ β”‚ P(causal) β”‚
82
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
83
+ ```
84
+
85
+ - **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
86
+ - **Hidden Dimension:** 768
87
+ - **Fusion Dimension:** 256
88
+ - **Parameters:** ~220M (two separate BERT encoders)
89
+
90
+ ### Training Procedure
91
+
92
+ The model was trained in two phases:
93
+
94
+ **Phase 1: Contrastive Pre-training (3 epochs)**
95
+ - InfoNCE loss with temperature Ο„=0.07
96
+ - Learns to bring true drug-ADR pairs close in embedding space
97
+ - Random negative sampling (mismatched pairs)
98
+
99
+ **Phase 2: Classification Fine-tuning (5 epochs)**
100
+ - Binary cross-entropy loss
101
+ - Balanced positive/negative samples
102
+ - Learning rate: 2e-5 with linear warmup
103
+
104
+ ### Training Data
105
+
106
+ - **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2)
107
+ - **Configuration:** `Ade_corpus_v2_drug_ade_relation`
108
+ - **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs
109
+ - **Validation Examples:** ~850 pairs
110
+
111
+ ## Performance
112
+
113
+ | Metric | Value |
114
+ |--------|-------|
115
+ | **F1 Score** | 88.3% |
116
+
117
+ ### Comparison with CRAG Family
118
+
119
+ | Model | F1 | AUC | Key Features |
120
+ |-------|-----|-----|--------------|
121
+ | **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives |
122
+ | CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
123
+ | CRAG-dual-encoder-mimicause | 98.8% | 99.9% | + MIMICause causal reasoning |
124
+
125
+ ## Usage
126
+
127
+ ```python
128
+ import torch
129
+ from transformers import AutoTokenizer, AutoModel
130
+
131
+ # Load model (custom architecture - need to define DualEncoderModel class)
132
+ # See training script for architecture definition
133
+
134
+ tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")
135
+
136
+ # Example: Score a drug-ADR pair
137
+ drug_context = "Patient was prescribed aspirin for pain management."
138
+ adr_context = "The patient experienced gastrointestinal bleeding."
139
+
140
+ # Tokenize
141
+ drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
142
+ adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
143
+
144
+ # Forward pass (pseudo-code - requires loading custom model)
145
+ # drug_repr = model.encode_drug(**drug_inputs)
146
+ # adr_repr = model.encode_adr(**adr_inputs)
147
+ # score = model.classify(drug_repr, adr_repr)
148
+ ```
149
+
150
+ ## Intended Uses
151
+
152
+ ### Primary Use Cases
153
+ - **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature
154
+ - **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis
155
+ - **Literature Mining:** Screening biomedical publications for adverse event reports
156
+ - **Clinical Decision Support:** Identifying potential drug safety signals
157
+
158
+ ### Out-of-Scope Uses
159
+ - Direct clinical decision-making without human review
160
+ - Diagnosis or treatment recommendations
161
+ - Processing non-English text
162
+ - Identifying drug-drug interactions (different task)
163
+
164
+ ## Limitations
165
+
166
+ 1. **English Only:** Trained exclusively on English biomedical text
167
+ 2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations
168
+ 3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context
169
+ 4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use
170
+
171
+ ## Ethical Considerations
172
+
173
+ - Model predictions should be validated by domain experts before use in clinical or regulatory settings
174
+ - False negatives may miss important safety signals; false positives may trigger unnecessary reviews
175
+ - The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)
176
+
177
+ ## Citation
178
+
179
+ ```bibtex
180
+ @misc{crag-dual-encoder-2024,
181
+ title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
182
+ author={von Csefalvay, Chris},
183
+ year={2024},
184
+ publisher={Hugging Face},
185
+ url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
186
+ }
187
+ ```
188
+
189
+ ## Model Card Authors
190
+
191
+ Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay))
192
+
193
+ ## Model Card Contact
194
+
195
+ For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.