| --- |
| language: en |
| license: mit |
| tags: |
| - biomedical |
| - relation-extraction |
| - pubmedbert |
| - named-entity-recognition |
| datasets: |
| - chemprot |
| - bc5cdr |
| - gad |
| - biored |
| - ddi |
| metrics: |
| - f1 |
| - precision |
| - recall |
| model-index: |
| - name: PubMedBERT Relation Extraction |
| results: |
| - task: |
| type: relation-extraction |
| name: Biomedical Relation Extraction |
| metrics: |
| - type: f1 |
| value: 0.7347 |
| name: F1 Macro |
| --- |
| |
| # PubMedBERT for Biomedical Relation Extraction |
|
|
| Fine-tuned [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for multi-class relation extraction in biomedical text. |
|
|
| ## Model Description |
|
|
| This model extracts semantic relations between biomedical entities (chemicals, diseases, genes, proteins) from scientific literature. |
|
|
| **Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract` |
|
|
| **Training Data:** chemprot, bc5cdr, gad, biored, ddi |
|
|
| **Relation Types (9):** |
| - `activates` |
| - `inhibits` |
| - `converts` |
| - `causes` |
| - `treats` |
| - `associated_with` |
| - `interacts_with` |
| - `located_in` |
| - `NO_RELATION` |
|
|
| ## Performance |
|
|
| | Metric | Value | |
| |--------|------:| |
| | F1 Macro | 0.7347 | |
| | Accuracy | 75.3% | |
|
|
| ### Per-Class F1 Scores |
|
|
| | Relation | F1 | Support | |
| |----------|---:|--------:| |
| | interacts_with | 0.85 | 1,304 | |
| | inhibits | 0.84 | 2,704 | |
| | activates | 0.83 | 3,412 | |
| | converts | 0.82 | 884 | |
| | associated_with | 0.81 | 1,769 | |
| | causes | 0.81 | 6,760 | |
| | NO_RELATION | 0.63 | 6,760 | |
| | treats | 0.28 | 678 | |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| # Load model and tokenizer |
| model_name = "your-username/pubmedbert-relation-extraction" |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
| # Add entity markers |
| special_tokens = {"additional_special_tokens": ["[E1]", "[/E1]", "[E2]", "[/E2]"]} |
| tokenizer.add_special_tokens(special_tokens) |
| model.resize_token_embeddings(len(tokenizer)) |
|
|
| # Example: Extract relation between aspirin and pain |
| text = "[E1]Aspirin[/E1] reduces [E2]pain[/E2] in patients." |
|
|
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) |
| outputs = model(**inputs) |
| probs = torch.softmax(outputs.logits, dim=-1) |
| predicted_class = torch.argmax(probs, dim=-1).item() |
| |
| print(f"Predicted relation: {model.config.id2label[predicted_class]}") |
| print(f"Confidence: {probs[0][predicted_class].item():.3f}") |
| ``` |
| |
| ## Input Format |
| |
| Text must contain entity markers `[E1]`, `[/E1]`, `[E2]`, `[/E2]` around the two entities: |
| |
| ``` |
| [E1]Entity1[/E1] ... context ... [E2]Entity2[/E2] |
| ``` |
| |
| ## Training Details |
| |
| - **Optimizer:** AdamW |
| - **Learning Rate:** 2e-5 |
| - **Batch Size:** 16 |
| - **Epochs:** 15 (early stopping) |
| - **Max Length:** 256 tokens |
| - **Loss:** Weighted CrossEntropy |
|
|
| ## Limitations |
|
|
| - `treats` relation has low F1 (0.28) due to limited training data |
| - Best performance on Chemical↔Gene/Protein and Disease relations |
| - Requires entity markers in input text |
| - Trained on English biomedical abstracts |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{pubmedbert-relation-extraction, |
| author = {Your Name}, |
| title = {PubMedBERT for Biomedical Relation Extraction}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| howpublished = {\url{https://huggingface.co/your-username/pubmedbert-relation-extraction}} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - Base model: [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) |
| - Datasets: ChemProt, BC5CDR, GAD, BioRED, DDI Corpus |
|
|