File size: 3,514 Bytes
3616405
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
language: en
license: mit
tags:
- biomedical
- relation-extraction
- pubmedbert
- named-entity-recognition
datasets:
- chemprot
- bc5cdr
- gad
- biored
- ddi
metrics:
- f1
- precision
- recall
model-index:
- name: PubMedBERT Relation Extraction
  results:
  - task:
      type: relation-extraction
      name: Biomedical Relation Extraction
    metrics:
    - type: f1
      value: 0.7347
      name: F1 Macro
---

# PubMedBERT for Biomedical Relation Extraction

Fine-tuned [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) for multi-class relation extraction in biomedical text.

## Model Description

This model extracts semantic relations between biomedical entities (chemicals, diseases, genes, proteins) from scientific literature.

**Base Model:** `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract`

**Training Data:** chemprot, bc5cdr, gad, biored, ddi

**Relation Types (9):**
- `activates`
- `inhibits`
- `converts`
- `causes`
- `treats`
- `associated_with`
- `interacts_with`
- `located_in`
- `NO_RELATION`

## Performance

| Metric | Value |
|--------|------:|
| F1 Macro | 0.7347 |
| Accuracy | 75.3% |

### Per-Class F1 Scores

| Relation | F1 | Support |
|----------|---:|--------:|
| interacts_with | 0.85 | 1,304 |
| inhibits | 0.84 | 2,704 |
| activates | 0.83 | 3,412 |
| converts | 0.82 | 884 |
| associated_with | 0.81 | 1,769 |
| causes | 0.81 | 6,760 |
| NO_RELATION | 0.63 | 6,760 |
| treats | 0.28 | 678 |

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/pubmedbert-relation-extraction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Add entity markers
special_tokens = {"additional_special_tokens": ["[E1]", "[/E1]", "[E2]", "[/E2]"]}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

# Example: Extract relation between aspirin and pain
text = "[E1]Aspirin[/E1] reduces [E2]pain[/E2] in patients."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()

print(f"Predicted relation: {model.config.id2label[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")
```

## Input Format

Text must contain entity markers `[E1]`, `[/E1]`, `[E2]`, `[/E2]` around the two entities:

```
[E1]Entity1[/E1] ... context ... [E2]Entity2[/E2]
```

## Training Details

- **Optimizer:** AdamW
- **Learning Rate:** 2e-5
- **Batch Size:** 16
- **Epochs:** 15 (early stopping)
- **Max Length:** 256 tokens
- **Loss:** Weighted CrossEntropy

## Limitations

- `treats` relation has low F1 (0.28) due to limited training data
- Best performance on Chemical↔Gene/Protein and Disease relations
- Requires entity markers in input text
- Trained on English biomedical abstracts

## Citation

```bibtex
@misc{pubmedbert-relation-extraction,
  author = {Your Name},
  title = {PubMedBERT for Biomedical Relation Extraction},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/pubmedbert-relation-extraction}}
}
```

## Acknowledgments

- Base model: [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)
- Datasets: ChemProt, BC5CDR, GAD, BioRED, DDI Corpus