File size: 3,514 Bytes
---
language: en
license: mit
tags:
- biomedical
- relation-extraction
- pubmedbert
- named-entity-recognition
datasets:
- chemprot
- bc5cdr
- gad
- biored
- ddi
metrics:
- f1
- precision
- recall
model-index:
- name: PubMedBERT Relation Extraction
  results:
  - task:
      type: relation-extraction
      name: Biomedical Relation Extraction
    metrics:
    - type: f1
      value: 0.7347
      name: F1 Macro
---

# PubMedBERT for Biomedical Relation Extraction

Fine-tuned [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for multi-class relation extraction in biomedical text.

## Model Description

This model extracts semantic relations between biomedical entities (chemicals, diseases, genes, proteins) from scientific literature.

**Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`

**Training Data:** chemprot, bc5cdr, gad, biored, ddi

**Relation Types (9):**
- `activates`
- `inhibits`
- `converts`
- `causes`
- `treats`
- `associated_with`
- `interacts_with`
- `located_in`
- `NO_RELATION`

## Performance

| Metric | Value |
|--------|------:|
| F1 Macro | 0.7347 |
| Accuracy | 75.3% |

### Per-Class F1 Scores

| Relation | F1 | Support |
|----------|---:|--------:|
| interacts_with | 0.85 | 1,304 |
| inhibits | 0.84 | 2,704 |
| activates | 0.83 | 3,412 |
| converts | 0.82 | 884 |
| associated_with | 0.81 | 1,769 |
| causes | 0.81 | 6,760 |
| NO_RELATION | 0.63 | 6,760 |
| treats | 0.28 | 678 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/pubmedbert-relation-extraction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Add entity markers
special_tokens = {"additional_special_tokens": ["[E1]", "[/E1]", "[E2]", "[/E2]"]}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

# Example: Extract relation between aspirin and pain
text = "[E1]Aspirin[/E1] reduces [E2]pain[/E2] in patients."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()

print(f"Predicted relation: {model.config.id2label[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")
```

## Input Format

Text must contain entity markers `[E1]`, `[/E1]`, `[E2]`, `[/E2]` around the two entities:

```
[E1]Entity1[/E1] ... context ... [E2]Entity2[/E2]
```

## Training Details

- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Batch Size:** 16
- **Epochs:** 15 (early stopping)
- **Max Length:** 256 tokens
- **Loss:** Weighted CrossEntropy

## Limitations

- `treats` relation has low F1 (0.28) due to limited training data
- Best performance on Chemical↔Gene/Protein and Disease relations
- Requires entity markers in input text
- Trained on English biomedical abstracts

## Citation

```bibtex
@misc{pubmedbert-relation-extraction,
  author = {Your Name},
  title = {PubMedBERT for Biomedical Relation Extraction},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/pubmedbert-relation-extraction}}
}
```

## Acknowledgments

- Base model: [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
- Datasets: ChemProt, BC5CDR, GAD, BioRED, DDI Corpus