| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - medical |
| - biomedical |
| - drug-safety |
| - adverse-drug-reactions |
| - pharmacovigilance |
| - relation-extraction |
| - dual-encoder |
| - clinical-nlp |
| - pubmedbert |
| datasets: |
| - ade-benchmark-corpus/ade_corpus_v2 |
| metrics: |
| - f1 |
| - roc_auc |
| pipeline_tag: text-classification |
| model-index: |
| - name: CRAG-dual-encoder-base |
| results: |
| - task: |
| type: text-classification |
| name: Drug-ADR Relation Extraction |
| dataset: |
| name: ADE Corpus V2 |
| type: ade-benchmark-corpus/ade_corpus_v2 |
| config: Ade_corpus_v2_drug_ade_relation |
| metrics: |
| - type: f1 |
| value: 0.883 |
| name: F1 Score |
| --- |
| |
| # CRAG-dual-encoder-base |
|
|
| **CRAG: Causal Reasoning for Adversomics Graphs** |
|
|
| This is the base model in the CRAG dual-encoder family for drug-adverse drug reaction (ADR) relation extraction. It uses a dual-encoder architecture with PubMedBERT to score drug-ADR pairs for causal pharmacovigilance graph construction. |
|
|
| ## Model Description |
|
|
| CRAG-dual-encoder-base is designed to identify causal relationships between drugs and adverse drug reactions from biomedical text. Given a drug mention and an ADR mention in context, the model predicts whether they share a causal relationship. |
|
|
| ### Architecture |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β CRAG Dual-Encoder Base β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ |
| β β |
| β Drug Context ADR Context β |
| β β β β |
| β βΌ βΌ β |
| β ββββββββββββ ββββββββββββ β |
| β βPubMedBERTβ βPubMedBERTβ (separate weights) β |
| β β Drug β β ADR β β |
| β β Encoder β β Encoder β β |
| β ββββββ¬ββββββ ββββββ¬ββββββ β |
| β β β β |
| β βΌ βΌ β |
| β [CLS] Pool [CLS] Pool β |
| β β β β |
| β ββββββββββ¬βββββββββββββ β |
| β β β |
| β βΌ β |
| β ββββββββββββββββ β |
| β β Bilinear β β |
| β β Fusion β β |
| β ββββββββ¬ββββββββ β |
| β β β |
| β βΌ β |
| β ββββββββββββββββ β |
| β β MLP Head β β |
| β β (256β1) β β |
| β ββββββββ¬ββββββββ β |
| β β β |
| β βΌ β |
| β P(causal) β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| - **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext` |
| - **Hidden Dimension:** 768 |
| - **Fusion Dimension:** 256 |
| - **Parameters:** ~220M (two separate BERT encoders) |
|
|
| ### Training Procedure |
|
|
| The model was trained in two phases: |
|
|
| **Phase 1: Contrastive Pre-training (3 epochs)** |
| - InfoNCE loss with temperature Ο=0.07 |
| - Learns to bring true drug-ADR pairs close in embedding space |
| - Random negative sampling (mismatched pairs) |
|
|
| **Phase 2: Classification Fine-tuning (5 epochs)** |
| - Binary cross-entropy loss |
| - Balanced positive/negative samples |
| - Learning rate: 2e-5 with linear warmup |
|
|
| ### Training Data |
|
|
| - **Dataset:** [ADE Corpus V2](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2) |
| - **Configuration:** `Ade_corpus_v2_drug_ade_relation` |
| - **Training Examples:** ~6,800 positive pairs + ~6,800 negative pairs |
| - **Validation Examples:** ~850 pairs |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|-------| |
| | **F1 Score** | 88.3% | |
|
|
| ### Comparison with CRAG Family |
|
|
| | Model | F1 | AUC | Key Features | |
| |-------|-----|-----|--------------| |
| | **CRAG-dual-encoder-base** | 88.3% | - | PubMedBERT, random negatives | |
| | CRAG-dual-encoder-ade | 97.5% | 99.1% | BioLinkBERT, hard negatives, focal loss | |
| | CRAG-dual-encoder-mimicause | 98.9% | 99.8% | + MIMICause causal reasoning | |
|
|
| ## Usage |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModel |
| |
| # Load model (custom architecture - need to define DualEncoderModel class) |
| # See training script for architecture definition |
| |
| tokenizer = AutoTokenizer.from_pretrained("chrisvoncsefalvay/CRAG-dual-encoder-base") |
| |
| # Example: Score a drug-ADR pair |
| drug_context = "Patient was prescribed aspirin for pain management." |
| adr_context = "The patient experienced gastrointestinal bleeding." |
| |
| # Tokenize |
| drug_inputs = tokenizer(drug_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length") |
| adr_inputs = tokenizer(adr_context, return_tensors="pt", max_length=128, truncation=True, padding="max_length") |
| |
| # Forward pass (pseudo-code - requires loading custom model) |
| # drug_repr = model.encode_drug(**drug_inputs) |
| # adr_repr = model.encode_adr(**adr_inputs) |
| # score = model.classify(drug_repr, adr_repr) |
| ``` |
|
|
| ## Intended Uses |
|
|
| ### Primary Use Cases |
| - **Pharmacovigilance:** Automated extraction of drug-ADR relationships from literature |
| - **Causal Graph Construction:** Building drug-ADR knowledge graphs for safety analysis |
| - **Literature Mining:** Screening biomedical publications for adverse event reports |
| - **Clinical Decision Support:** Identifying potential drug safety signals |
|
|
| ### Out-of-Scope Uses |
| - Direct clinical decision-making without human review |
| - Diagnosis or treatment recommendations |
| - Processing non-English text |
| - Identifying drug-drug interactions (different task) |
|
|
| ## Limitations |
|
|
| 1. **English Only:** Trained exclusively on English biomedical text |
| 2. **Domain Specific:** Optimized for drug-ADR relationships; may not generalize to other biomedical relations |
| 3. **Context Dependency:** Requires both drug and ADR to be mentioned in related context |
| 4. **Base Model Performance:** This base version achieves 88.3% F1; consider using CRAG-dual-encoder-ade or CRAG-dual-encoder-mimicause for production use |
|
|
| ## Ethical Considerations |
|
|
| - Model predictions should be validated by domain experts before use in clinical or regulatory settings |
| - False negatives may miss important safety signals; false positives may trigger unnecessary reviews |
| - The model reflects biases present in the training data (ADE Corpus V2, sourced from MEDLINE) |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{crag-dual-encoder-2024, |
| title={CRAG: Causal Reasoning for Adversomics Graphs - Dual-Encoder Models for Drug-ADR Relation Extraction}, |
| author={von Csefalvay, Chris}, |
| year={2024}, |
| publisher={Hugging Face}, |
| url={https://huggingface.co/chrisvoncsefalvay/CRAG-dual-encoder-base} |
| } |
| ``` |
|
|
| ## Model Card Authors |
|
|
| Chris von Csefalvay ([@chrisvoncsefalvay](https://huggingface.co/chrisvoncsefalvay)) |
|
|
| ## Model Card Contact |
|
|
| For questions or issues, please open a discussion on this model's repository or contact chris@chrisvoncsefalvay.com. |
|
|