wesin's picture
Upload folder using huggingface_hub
3616405 verified
metadata
language: en
license: mit
tags:
  - biomedical
  - relation-extraction
  - pubmedbert
  - named-entity-recognition
datasets:
  - chemprot
  - bc5cdr
  - gad
  - biored
  - ddi
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: PubMedBERT Relation Extraction
    results:
      - task:
          type: relation-extraction
          name: Biomedical Relation Extraction
        metrics:
          - type: f1
            value: 0.7347
            name: F1 Macro

PubMedBERT for Biomedical Relation Extraction

Fine-tuned PubMedBERT for multi-class relation extraction in biomedical text.

Model Description

This model extracts semantic relations between biomedical entities (chemicals, diseases, genes, proteins) from scientific literature.

Base Model: microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract

Training Data: chemprot, bc5cdr, gad, biored, ddi

Relation Types (9):

  • activates
  • inhibits
  • converts
  • causes
  • treats
  • associated_with
  • interacts_with
  • located_in
  • NO_RELATION

Performance

Metric Value
F1 Macro 0.7347
Accuracy 75.3%

Per-Class F1 Scores

Relation F1 Support
interacts_with 0.85 1,304
inhibits 0.84 2,704
activates 0.83 3,412
converts 0.82 884
associated_with 0.81 1,769
causes 0.81 6,760
NO_RELATION 0.63 6,760
treats 0.28 678

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "your-username/pubmedbert-relation-extraction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Add entity markers
special_tokens = {"additional_special_tokens": ["[E1]", "[/E1]", "[E2]", "[/E2]"]}
tokenizer.add_special_tokens(special_tokens)
model.resize_token_embeddings(len(tokenizer))

# Example: Extract relation between aspirin and pain
text = "[E1]Aspirin[/E1] reduces [E2]pain[/E2] in patients."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()

print(f"Predicted relation: {model.config.id2label[predicted_class]}")
print(f"Confidence: {probs[0][predicted_class].item():.3f}")

Input Format

Text must contain entity markers [E1], [/E1], [E2], [/E2] around the two entities:

[E1]Entity1[/E1] ... context ... [E2]Entity2[/E2]

Training Details

  • Optimizer: AdamW
  • Learning Rate: 2e-5
  • Batch Size: 16
  • Epochs: 15 (early stopping)
  • Max Length: 256 tokens
  • Loss: Weighted CrossEntropy

Limitations

  • treats relation has low F1 (0.28) due to limited training data
  • Best performance on Chemical↔Gene/Protein and Disease relations
  • Requires entity markers in input text
  • Trained on English biomedical abstracts

Citation

@misc{pubmedbert-relation-extraction,
  author = {Your Name},
  title = {PubMedBERT for Biomedical Relation Extraction},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/your-username/pubmedbert-relation-extraction}}
}

Acknowledgments

  • Base model: PubMedBERT
  • Datasets: ChemProt, BC5CDR, GAD, BioRED, DDI Corpus