|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- ner |
|
|
- biomedical |
|
|
- token-classification |
|
|
- roberta |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- bc5cdr |
|
|
- ncbi_disease |
|
|
--- |
|
|
|
|
|
# biomedical_ner_roberta_base |
|
|
|
|
|
## Overview |
|
|
|
|
|
`biomedical_ner_roberta_base` is a token classification model specifically fine-tuned for Named Entity Recognition (NER) in the biomedical domain. It is designed to extract entities from scientific abstracts, clinical notes, and medical literature. |
|
|
|
|
|
The model identifies three primary entity types using the BIO labeling scheme: |
|
|
* **DISEASE**: Pathological conditions, signs, and symptoms. |
|
|
* **CHEMICAL**: Drugs, medications, and chemical compounds. |
|
|
* **GENE**: Genes, proteins, and related molecular structures. |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
This model is based on the `roberta-base` architecture, fine-tuned using `RobertaForTokenClassification`. It was trained on a composite dataset including BC5CDR (BioCreative V CDR task corpus) and the NCBI Disease corpus. |
|
|
|
|
|
- **Base Model:** RoBERTa Base (12 layers, 768 hidden dimension, 12 heads, 125M parameters). |
|
|
- **Task:** Token Classification (7 labels: O, B-DISEASE, I-DISEASE, B-CHEMICAL, I-CHEMICAL, B-GENE, I-GENE). |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
This model is intended for researchers and developers working with biomedical text data. |
|
|
|
|
|
- **Information Extraction:** Automated parsing of PubMed abstracts to identify key biomedical concepts. |
|
|
- **Knowledge Graph Construction:** Linking genes, drugs, and diseases discovered in text to structured knowledge bases. |
|
|
- **Clinical Text Mining:** Assisting in extracting relevant information from unstructured electronic health records (EHRs). |
|
|
|
|
|
### How to use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
from transformers import pipeline |
|
|
|
|
|
model_name = "your_username/biomedical_ner_roberta_base" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
|
|
text = "The patient was treated with metformin for type 2 diabetes, but showed resistance related to the SLC22A1 gene variant." |
|
|
results = nlp(text) |
|
|
|
|
|
for entity in results: |
|
|
print(f"Entity: {entity['word']}, Label: {entity['entity_group']}, Score: {entity['score']:.4f}") |
|
|
|
|
|
# Expected Output structure: |
|
|
# Entity: metformin, Label: CHEMICAL, Score: 0.99... |
|
|
# Entity: type 2 diabetes, Label: DISEASE, Score: 0.98... |
|
|
# Entity: SLC22A1, Label: GENE, Score: 0.97... |