language:
- en
license: mit
pipeline_tag: token-classification
task_categories:
- token-classification
tags:
- medical
- biomedical
- ner
- named-entity-recognition
- biobert
- jargon-detection
datasets:
- tner/bc5cdr
base_model: dmis-lab/biobert-v1.1
metrics:
- f1
- precision
- recall
model-index:
- name: BioBERT-BC5CDR-NER
results:
- task:
type: token-classification
name: Named Entity Recognition
dataset:
name: BC5CDR
type: tner/bc5cdr
metrics:
- type: f1
value: 0.88
name: F1 Score
- type: precision
value: 0.88
- type: recall
value: 0.89
Medical Named Entity Recognition (NER) Model
Model Description
This model is a fine-tuned version of dmis-lab/biobert-v1.1 on the BC5CDR dataset for medical named entity recognition.
What it does: Identifies medical terminology in text, specifically:
- Chemical entities: Drug names, chemical compounds (e.g., aspirin, metformin)
- Disease entities: Medical conditions, diseases (e.g., hypertension, diabetes)
Intended use: Assist in reading medical literature by highlighting and explaining technical terminology.
Training Data
- Dataset: BC5CDR (BioCreative V Chemical Disease Relation)
- Training samples: 5,228 sentences
- Validation samples: 5,330 sentences
- Test samples: 5,865 sentences
- Entity types: 5 labels (O, B-Chemical, I-Chemical, B-Disease, I-Disease)
Model Performance
Evaluated on BC5CDR test set:
| Metric | Score |
|---|---|
| F1 Score | 0.918555 |
| Precision | 0.905610 |
| Recall | 0.931875 |
Usage
Basic Usage
from transformers import pipeline
# Load the model
ner = pipeline(
"token-classification",
model="{repo_id}",
aggregation_strategy="simple"
)
# Analyze medical text
text = "Patient diagnosed with hypertension and prescribed metformin."
results = ner(text)
# Print results
for entity in results:
print(f"{{entity['word']}}: {{entity['entity_group']}} ({{entity['score']:.2f}})")
Output:
hypertension: Disease (0.99)
metformin: Chemical (0.99)
Advanced Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
# Tokenize input
text = "Patient has diabetes and takes aspirin."
inputs = tokenizer(text, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]
for token, label in zip(tokens, labels):
if label != "O":
print(f"{{token}}: {{label}}")
Label Schema
The model uses IOB2 tagging scheme:
| Label | Description |
|---|---|
O |
Outside any entity |
B-Chemical |
Beginning of a chemical/drug entity |
I-Chemical |
Inside a chemical/drug entity (continuation) |
B-Disease |
Beginning of a disease entity |
I-Disease |
Inside a disease entity (continuation) |
Training Details
Training Hyperparameters
- Base model: dmis-lab/biobert-v1.1
- Training regime: Fine-tuning
- Optimizer: AdamW
- Learning rate: 5e-5
- Batch size: 16 (per device)
- Number of epochs: 3
- Weight decay: 0.01
- Learning rate scheduler: Linear warmup
- Mixed precision: FP16
Training Environment
- Framework: PyTorch with Transformers library
- Hardware: NVIDIA T4 GPU (Google Colab)
- Training time: ~30 minutes
Data Preprocessing
- Tokenization using BioBERT WordPiece tokenizer
- Maximum sequence length: 128 tokens
- Label alignment for subword tokens
- Special tokens: [CLS], [SEP]
Limitations and Bias
Limitations
- Domain-specific: Trained on biomedical literature; may not perform well on clinical notes or patient records
- Entity types: Only detects chemicals and diseases; does not identify procedures, anatomical terms, or symptoms
- Language: English only
- Abbreviations: May struggle with uncommon medical abbreviations
- Context: Does not disambiguate terms (e.g., "cold" as temperature vs. illness)
Potential Biases
- Training data (BC5CDR) comes from scientific publications, which may have different terminology than patient-facing materials
- More chemical entities than disease entities in training data may affect balance
- Contemporary medical terminology may not be represented if not in training corpus
Ethical Considerations
- Not for medical diagnosis: This model is for educational/assistive purposes only
- Human oversight required: Always verify medical information with qualified healthcare professionals
- Privacy: Do not input personally identifiable information (PII) or protected health information (PHI)
Citation
If you use this model, please cite:
@misc{{{repo_id.replace('/', '-')}}},
author = {{{YOUR_NAME}}},
title = {{Medical Named Entity Recognition with BioBERT}},
year = {{2024}},
publisher = {{HuggingFace}},
url = {{https://huggingface.co/{repo_id}}}
}}
Also cite the original BC5CDR dataset:
@article{{wei2016assessing,
title={{Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}},
author={{Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}},
journal={{Database}},
volume={{2016}},
year={{2016}},
publisher={{Oxford Academic}}
}}
And the BioBERT model:
@article{{lee2020biobert,
title={{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}},
author={{Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}},
journal={{Bioinformatics}},
volume={{36}},
number={{4}},
pages={{1234--1240}},
year={{2020}},
publisher={{Oxford University Press}}
}}
Contact
- Author: {YOUR_NAME}
- Email: {YOUR_EMAIL}
- GitHub: Your GitHub Profile
- Project Repository: [Link to your project repo]
Acknowledgments
- Base model: dmis-lab/biobert-v1.1
- Dataset: BC5CDR
- Built with HuggingFace Transformers
License
This model is released under the MIT License. See LICENSE for details.
Model card last updated: {import('datetime').datetime.now().strftime('%Y-%m-%d')} """
Save to file
model_path = "./biobert-ner-final" readme_path = f"{model_path}/README.md"
with open(readme_path, "w", encoding="utf-8") as f: f.write(model_card)
print("✓ Model card created!") print(f"Saved to: {readme_path}")
Upload to HuggingFace
api = HfApi() api.upload_file( path_or_fileobj=readme_path, path_in_repo="README.md", repo_id=repo_id, repo_type="model", )
print(f"✓ Uploaded to: https://huggingface.co/{repo_id}")