Create README.md

434252c verified 3 months ago

7.45 kB

language:
  - en
license: mit
pipeline_tag: token-classification
task_categories:
  - token-classification
tags:
  - medical
  - biomedical
  - ner
  - named-entity-recognition
  - biobert
  - jargon-detection
datasets:
  - tner/bc5cdr
base_model: dmis-lab/biobert-v1.1
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: BioBERT-BC5CDR-NER
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: BC5CDR
          type: tner/bc5cdr
        metrics:
          - type: f1
            value: 0.88
            name: F1 Score
          - type: precision
            value: 0.88
          - type: recall
            value: 0.89

Medical Named Entity Recognition (NER) Model

Model Description

This model is a fine-tuned version of dmis-lab/biobert-v1.1 on the BC5CDR dataset for medical named entity recognition.

What it does: Identifies medical terminology in text, specifically:

Chemical entities: Drug names, chemical compounds (e.g., aspirin, metformin)
Disease entities: Medical conditions, diseases (e.g., hypertension, diabetes)

Intended use: Assist in reading medical literature by highlighting and explaining technical terminology.

Training Data

Dataset: BC5CDR (BioCreative V Chemical Disease Relation)
Training samples: 5,228 sentences
Validation samples: 5,330 sentences
Test samples: 5,865 sentences
Entity types: 5 labels (O, B-Chemical, I-Chemical, B-Disease, I-Disease)

Model Performance

Evaluated on BC5CDR test set:

Metric	Score
F1 Score	0.918555
Precision	0.905610
Recall	0.931875

Usage

Basic Usage

from transformers import pipeline

# Load the model
ner = pipeline(
    "token-classification",
    model="{repo_id}",
    aggregation_strategy="simple"
)

# Analyze medical text
text = "Patient diagnosed with hypertension and prescribed metformin."
results = ner(text)

# Print results
for entity in results:
    print(f"{{entity['word']}}: {{entity['entity_group']}} ({{entity['score']:.2f}})")

Output:

hypertension: Disease (0.99)
metformin: Chemical (0.99)

Advanced Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
model = AutoModelForTokenClassification.from_pretrained("{repo_id}")

# Tokenize input
text = "Patient has diabetes and takes aspirin."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{{token}}: {{label}}")

Label Schema

The model uses IOB2 tagging scheme:

Label	Description
`O`	Outside any entity
`B-Chemical`	Beginning of a chemical/drug entity
`I-Chemical`	Inside a chemical/drug entity (continuation)
`B-Disease`	Beginning of a disease entity
`I-Disease`	Inside a disease entity (continuation)

Training Details

Training Hyperparameters

Base model: dmis-lab/biobert-v1.1
Training regime: Fine-tuning
Optimizer: AdamW
Learning rate: 5e-5
Batch size: 16 (per device)
Number of epochs: 3
Weight decay: 0.01
Learning rate scheduler: Linear warmup
Mixed precision: FP16

Training Environment

Framework: PyTorch with Transformers library
Hardware: NVIDIA T4 GPU (Google Colab)
Training time: ~30 minutes

Data Preprocessing

Tokenization using BioBERT WordPiece tokenizer
Maximum sequence length: 128 tokens
Label alignment for subword tokens
Special tokens: [CLS], [SEP]

Limitations and Bias

Limitations

Domain-specific: Trained on biomedical literature; may not perform well on clinical notes or patient records
Entity types: Only detects chemicals and diseases; does not identify procedures, anatomical terms, or symptoms
Language: English only
Abbreviations: May struggle with uncommon medical abbreviations
Context: Does not disambiguate terms (e.g., "cold" as temperature vs. illness)

Potential Biases

Training data (BC5CDR) comes from scientific publications, which may have different terminology than patient-facing materials
More chemical entities than disease entities in training data may affect balance
Contemporary medical terminology may not be represented if not in training corpus

Ethical Considerations

Not for medical diagnosis: This model is for educational/assistive purposes only
Human oversight required: Always verify medical information with qualified healthcare professionals
Privacy: Do not input personally identifiable information (PII) or protected health information (PHI)

Citation

If you use this model, please cite:

@misc{{{repo_id.replace('/', '-')}}},
  author = {{{YOUR_NAME}}},
  title = {{Medical Named Entity Recognition with BioBERT}},
  year = {{2024}},
  publisher = {{HuggingFace}},
  url = {{https://huggingface.co/{repo_id}}}
}}

Also cite the original BC5CDR dataset:

@article{{wei2016assessing,
  title={{Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}},
  author={{Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}},
  journal={{Database}},
  volume={{2016}},
  year={{2016}},
  publisher={{Oxford Academic}}
}}

And the BioBERT model:

@article{{lee2020biobert,
  title={{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}},
  author={{Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}},
  journal={{Bioinformatics}},
  volume={{36}},
  number={{4}},
  pages={{1234--1240}},
  year={{2020}},
  publisher={{Oxford University Press}}
}}

Contact

Author: {YOUR_NAME}
Email: {YOUR_EMAIL}
GitHub: Your GitHub Profile
Project Repository: [Link to your project repo]

Acknowledgments

Base model: dmis-lab/biobert-v1.1
Dataset: BC5CDR
Built with HuggingFace Transformers

License

This model is released under the MIT License. See LICENSE for details.

Model card last updated: {import('datetime').datetime.now().strftime('%Y-%m-%d')} """

Save to file

model_path = "./biobert-ner-final" readme_path = f"{model_path}/README.md"

with open(readme_path, "w", encoding="utf-8") as f: f.write(model_card)

print("✓ Model card created!") print(f"Saved to: {readme_path}")

Upload to HuggingFace

api = HfApi() api.upload_file( path_or_fileobj=readme_path, path_in_repo="README.md", repo_id=repo_id, repo_type="model", )

print(f"✓ Uploaded to: https://huggingface.co/{repo_id}")