chrisvoncsefalvay
/

CRAG-dual-encoder-base

+---
+license: apache-2.0
+language:
+- en
+tags:
+- medical
+- biomedical
+- drug-safety
+- adverse-drug-reactions
+- pharmacovigilance
+- relation-extraction
+- dual-encoder
+- clinical-nlp
+- pubmedbert
+datasets:
+- ade-benchmark-corpus/ade_corpus_v2
+metrics:
+- f1
+- roc_auc
+pipeline_tag: text-classification
+model-index:
+- name: CRAG-dual-encoder-base
+  results:
+  - task:
+      type: text-classification
+      name: Drug-ADR Relation Extraction
+    dataset:
+      name: ADE Corpus V2
+      type: ade-benchmark-corpus/ade_corpus_v2
+      config: Ade_corpus_v2_drug_ade_relation
+    metrics:
+    - type: f1
+      value: 0.883
+      name: F1 Score
+---
+# CRAG-dual-encoder-base
+**CRAG: Causal Reasoning for Adversomics Graphs**
+This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction.
+## Model Description
+CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship.
+### Architecture
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    CRAG Dual-Encoder Base                   │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│   Drug Context          ADR Context                         │
+│        │                     │                              │
+│        ▼                     ▼                              │
+│  ┌──────────┐          ┌──────────┐                         │
+│  │PubMedBERT│          │PubMedBERT│    (separate weights)   │
+│  │  Drug    │          │   ADR    │                         │
+│  │ Encoder  │          │ Encoder  │                         │
+│  └────┬─────┘          └────┬─────┘                         │
+│       │                     │                               │
+│       ▼                     ▼                               │
+│  [CLS] Pool            [CLS] Pool                           │
+│       │                     │                               │
+│       └────────┬────────────┘                               │
+│                │                                            │
+│                ▼                                            │
+│        ┌──────────────┐                                     │
+│        │   Bilinear   │                                     │
+│        │   Fusion     │                                     │
+│        └──────┬───────┘                                     │
+│               │                                             │
+│               ▼                                             │
+│        ┌──────────────┐                                     │
+│        │  MLP Head    │                                     │
+│        │  (256→1)     │                                     │
+│        └──────┬───────┘                                     │
+│               │                                             │
+│               ▼                                             │
+│           P(causal)                                         │
+└─────────────────────────────────────────────────────────────┘
+```
+- **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext`
+- **Hidden Dimension:** 768
+- **Fusion Dimension:** 256
+- **Parameters:** ~220M (two separate BERT encoders)
+### Training Procedure
+The model was trained in two phases:
+**Phase 1: Contrastive Pre-training (3 epochs)**
+- InfoNCE loss with temperature τ=0.07
+- Learns to bring true drug-ADR pairs close in embedding space
+- Random negative sampling (mismatched pairs)
+**Phase 2: Classification Fine-tuning (5 epochs)**
+- Binary cross-entropy loss
+- Balanced positive/negative samples
+- Learning rate: 2e-5 with linear warmup
+### Training Data
+- **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2)
+- **Configuration:** `Ade_corpus_v2_drug_ade_relation`
+- **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs
+- **Validation Examples:** ~850 pairs
+## Performance
+| Metric | Value |
+|--------|-------|
+| **F1 Score** | 88.3% |
+### Comparison with CRAG Family
+| Model | F1 | AUC | Key Features |
+|-------|-----|-----|--------------|
+| **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives |
+| CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss |
+| CRAG-dual-encoder-mimicause | 98.8% | 99.9% | + MIMICause causal reasoning |
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer, AutoModel
+# Load model (custom architecture - need to define DualEncoderModel class)
+# See training script for architecture definition
+tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base")
+# Example: Score a drug-ADR pair
+drug_context = "Patient was prescribed aspirin for pain management."
+adr_context = "The patient experienced gastrointestinal bleeding."
+# Tokenize
+drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
+adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length")
+# Forward pass (pseudo-code - requires loading custom model)
+# drug_repr = model.encode_drug(**drug_inputs)
+# adr_repr = model.encode_adr(**adr_inputs)
+# score = model.classify(drug_repr, adr_repr)
+```
+## Intended Uses
+### Primary Use Cases
+- **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature
+- **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis
+- **Literature Mining:** Screening biomedical publications for adverse event reports
+- **Clinical Decision Support:** Identifying potential drug safety signals
+### Out-of-Scope Uses
+- Direct clinical decision-making without human review
+- Diagnosis or treatment recommendations
+- Processing non-English text
+- Identifying drug-drug interactions (different task)
+## Limitations
+1. **English Only:** Trained exclusively on English biomedical text
+2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations
+3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context
+4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use
+## Ethical Considerations
+- Model predictions should be validated by domain experts before use in clinical or regulatory settings
+- False negatives may miss important safety signals; false positives may trigger unnecessary reviews
+- The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE)
+## Citation
+```bibtex
+@misc{crag-dual-encoder-2024,
+  title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction},
+  author={von Csefalvay, Chris},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base}
+}
+```
+## Model Card Authors
+Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay))
+## Model Card Contact
+For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com.