medjar-ner-model / README.md
viop1504's picture
Create README.md
434252c verified
|
raw
history blame
7.45 kB
metadata
language:
  - en
license: mit
pipeline_tag: token-classification
task_categories:
  - token-classification
tags:
  - medical
  - biomedical
  - ner
  - named-entity-recognition
  - biobert
  - jargon-detection
datasets:
  - tner/bc5cdr
base_model: dmis-lab/biobert-v1.1
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: BioBERT-BC5CDR-NER
    results:
      - task:
          type: token-classification
          name: Named Entity Recognition
        dataset:
          name: BC5CDR
          type: tner/bc5cdr
        metrics:
          - type: f1
            value: 0.88
            name: F1 Score
          - type: precision
            value: 0.88
          - type: recall
            value: 0.89

Medical Named Entity Recognition (NER) Model

Model Description

This model is a fine-tuned version of dmis-lab/biobert-v1.1 on the BC5CDR dataset for medical named entity recognition.

What it does: Identifies medical terminology in text, specifically:

  • Chemical entities: Drug names, chemical compounds (e.g., aspirin, metformin)
  • Disease entities: Medical conditions, diseases (e.g., hypertension, diabetes)

Intended use: Assist in reading medical literature by highlighting and explaining technical terminology.

Training Data

  • Dataset: BC5CDR (BioCreative V Chemical Disease Relation)
  • Training samples: 5,228 sentences
  • Validation samples: 5,330 sentences
  • Test samples: 5,865 sentences
  • Entity types: 5 labels (O, B-Chemical, I-Chemical, B-Disease, I-Disease)

Model Performance

Evaluated on BC5CDR test set:

Metric Score
F1 Score 0.918555
Precision 0.905610
Recall 0.931875

Usage

Basic Usage

from transformers import pipeline

# Load the model
ner = pipeline(
    "token-classification",
    model="{repo_id}",
    aggregation_strategy="simple"
)

# Analyze medical text
text = "Patient diagnosed with hypertension and prescribed metformin."
results = ner(text)

# Print results
for entity in results:
    print(f"{{entity['word']}}: {{entity['entity_group']}} ({{entity['score']:.2f}})")

Output:

hypertension: Disease (0.99)
metformin: Chemical (0.99)

Advanced Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
model = AutoModelForTokenClassification.from_pretrained("{repo_id}")

# Tokenize input
text = "Patient has diabetes and takes aspirin."
inputs = tokenizer(text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[p.item()] for p in predictions[0]]

for token, label in zip(tokens, labels):
    if label != "O":
        print(f"{{token}}: {{label}}")

Label Schema

The model uses IOB2 tagging scheme:

Label Description
O Outside any entity
B-Chemical Beginning of a chemical/drug entity
I-Chemical Inside a chemical/drug entity (continuation)
B-Disease Beginning of a disease entity
I-Disease Inside a disease entity (continuation)

Training Details

Training Hyperparameters

  • Base model: dmis-lab/biobert-v1.1
  • Training regime: Fine-tuning
  • Optimizer: AdamW
  • Learning rate: 5e-5
  • Batch size: 16 (per device)
  • Number of epochs: 3
  • Weight decay: 0.01
  • Learning rate scheduler: Linear warmup
  • Mixed precision: FP16

Training Environment

  • Framework: PyTorch with Transformers library
  • Hardware: NVIDIA T4 GPU (Google Colab)
  • Training time: ~30 minutes

Data Preprocessing

  1. Tokenization using BioBERT WordPiece tokenizer
  2. Maximum sequence length: 128 tokens
  3. Label alignment for subword tokens
  4. Special tokens: [CLS], [SEP]

Limitations and Bias

Limitations

  • Domain-specific: Trained on biomedical literature; may not perform well on clinical notes or patient records
  • Entity types: Only detects chemicals and diseases; does not identify procedures, anatomical terms, or symptoms
  • Language: English only
  • Abbreviations: May struggle with uncommon medical abbreviations
  • Context: Does not disambiguate terms (e.g., "cold" as temperature vs. illness)

Potential Biases

  • Training data (BC5CDR) comes from scientific publications, which may have different terminology than patient-facing materials
  • More chemical entities than disease entities in training data may affect balance
  • Contemporary medical terminology may not be represented if not in training corpus

Ethical Considerations

  • Not for medical diagnosis: This model is for educational/assistive purposes only
  • Human oversight required: Always verify medical information with qualified healthcare professionals
  • Privacy: Do not input personally identifiable information (PII) or protected health information (PHI)

Citation

If you use this model, please cite:

@misc{{{repo_id.replace('/', '-')}}},
  author = {{{YOUR_NAME}}},
  title = {{Medical Named Entity Recognition with BioBERT}},
  year = {{2024}},
  publisher = {{HuggingFace}},
  url = {{https://huggingface.co/{repo_id}}}
}}

Also cite the original BC5CDR dataset:

@article{{wei2016assessing,
  title={{Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}},
  author={{Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}},
  journal={{Database}},
  volume={{2016}},
  year={{2016}},
  publisher={{Oxford Academic}}
}}

And the BioBERT model:

@article{{lee2020biobert,
  title={{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}},
  author={{Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}},
  journal={{Bioinformatics}},
  volume={{36}},
  number={{4}},
  pages={{1234--1240}},
  year={{2020}},
  publisher={{Oxford University Press}}
}}

Contact

  • Author: {YOUR_NAME}
  • Email: {YOUR_EMAIL}
  • GitHub: Your GitHub Profile
  • Project Repository: [Link to your project repo]

Acknowledgments

License

This model is released under the MIT License. See LICENSE for details.


Model card last updated: {import('datetime').datetime.now().strftime('%Y-%m-%d')} """

Save to file

model_path = "./biobert-ner-final" readme_path = f"{model_path}/README.md"

with open(readme_path, "w", encoding="utf-8") as f: f.write(model_card)

print("✓ Model card created!") print(f"Saved to: {readme_path}")

Upload to HuggingFace

api = HfApi() api.upload_file( path_or_fileobj=readme_path, path_in_repo="README.md", repo_id=repo_id, repo_type="model", )

print(f"✓ Uploaded to: https://huggingface.co/{repo_id}")