viop1504
/

medjar-ner-model

+---
+language:
+- en
+license: mit
+pipeline_tag: token-classification
+task_categories:
+- token-classification
+tags:
+- medical
+- biomedical
+- ner
+- named-entity-recognition
+- biobert
+- jargon-detection
+datasets:
+- tner/bc5cdr
+base_model: dmis-lab/biobert-v1.1
+metrics:
+- f1
+- precision
+- recall
+model-index:
+- name: BioBERT-BC5CDR-NER
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    dataset:
+      name: BC5CDR
+      type: tner/bc5cdr
+    metrics:
+    - type: f1
+      value: 0.88
+      name: F1 Score
+    - type: precision
+      value: 0.88
+    - type: recall
+      value: 0.89
+---
+# Medical Named Entity Recognition (NER) Model
+## Model Description
+This model is a fine-tuned version of [dmis-lab/biobert-v1.1](https://huggingface.co/dmis-lab/biobert-v1.1) on the BC5CDR dataset for medical named entity recognition.
+**What it does:** Identifies medical terminology in text, specifically:
+- **Chemical entities**: Drug names, chemical compounds (e.g., aspirin, metformin)
+- **Disease entities**: Medical conditions, diseases (e.g., hypertension, diabetes)
+**Intended use:** Assist in reading medical literature by highlighting and explaining technical terminology.
+## Training Data
+- **Dataset**: [BC5CDR](https://huggingface.co/datasets/tner/bc5cdr) (BioCreative V Chemical Disease Relation)
+- **Training samples**: 5,228 sentences
+- **Validation samples**: 5,330 sentences
+- **Test samples**: 5,865 sentences
+- **Entity types**: 5 labels (O, B-Chemical, I-Chemical, B-Disease, I-Disease)
+## Model Performance
+Evaluated on BC5CDR test set:
+| Metric    | Score |
+|-----------|-------|
+| F1 Score  | 0.918555 |
+| Precision | 0.905610 |
+| Recall    | 0.931875 |
+## Usage
+### Basic Usage
+```python
+from transformers import pipeline
+# Load the model
+ner = pipeline(
+    "token-classification",
+    model="{repo_id}",
+    aggregation_strategy="simple"
+)
+# Analyze medical text
+text = "Patient diagnosed with hypertension and prescribed metformin."
+results = ner(text)
+# Print results
+for entity in results:
+    print(f"{{entity['word']}}: {{entity['entity_group']}} ({{entity['score']:.2f}})")
+```
+**Output:**
+```
+hypertension: Disease (0.99)
+metformin: Chemical (0.99)
+```
+### Advanced Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+# Load model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
+model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
+# Tokenize input
+text = "Patient has diabetes and takes aspirin."
+inputs = tokenizer(text, return_tensors="pt")
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=-1)
+# Decode predictions
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+labels = [model.config.id2label[p.item()] for p in predictions[0]]
+for token, label in zip(tokens, labels):
+    if label != "O":
+        print(f"{{token}}: {{label}}")
+```
+## Label Schema
+The model uses IOB2 tagging scheme:
+| Label | Description |
+|-------|-------------|
+| `O` | Outside any entity |
+| `B-Chemical` | Beginning of a chemical/drug entity |
+| `I-Chemical` | Inside a chemical/drug entity (continuation) |
+| `B-Disease` | Beginning of a disease entity |
+| `I-Disease` | Inside a disease entity (continuation) |
+## Training Details
+### Training Hyperparameters
+- **Base model**: dmis-lab/biobert-v1.1
+- **Training regime**: Fine-tuning
+- **Optimizer**: AdamW
+- **Learning rate**: 5e-5
+- **Batch size**: 16 (per device)
+- **Number of epochs**: 3
+- **Weight decay**: 0.01
+- **Learning rate scheduler**: Linear warmup
+- **Mixed precision**: FP16
+### Training Environment
+- **Framework**: PyTorch with Transformers library
+- **Hardware**: NVIDIA T4 GPU (Google Colab)
+- **Training time**: ~30 minutes
+### Data Preprocessing
+1. Tokenization using BioBERT WordPiece tokenizer
+2. Maximum sequence length: 128 tokens
+3. Label alignment for subword tokens
+4. Special tokens: [CLS], [SEP]
+## Limitations and Bias
+### Limitations
+- **Domain-specific**: Trained on biomedical literature; may not perform well on clinical notes or patient records
+- **Entity types**: Only detects chemicals and diseases; does not identify procedures, anatomical terms, or symptoms
+- **Language**: English only
+- **Abbreviations**: May struggle with uncommon medical abbreviations
+- **Context**: Does not disambiguate terms (e.g., "cold" as temperature vs. illness)
+### Potential Biases
+- Training data (BC5CDR) comes from scientific publications, which may have different terminology than patient-facing materials
+- More chemical entities than disease entities in training data may affect balance
+- Contemporary medical terminology may not be represented if not in training corpus
+## Ethical Considerations
+- **Not for medical diagnosis**: This model is for educational/assistive purposes only
+- **Human oversight required**: Always verify medical information with qualified healthcare professionals
+- **Privacy**: Do not input personally identifiable information (PII) or protected health information (PHI)
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{{{repo_id.replace('/', '-')}}},
+  author = {{{YOUR_NAME}}},
+  title = {{Medical Named Entity Recognition with BioBERT}},
+  year = {{2024}},
+  publisher = {{HuggingFace}},
+  url = {{https://huggingface.co/{repo_id}}}
+}}
+```
+Also cite the original BC5CDR dataset:
+```bibtex
+@article{{wei2016assessing,
+  title={{Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task}},
+  author={{Wei, Chih-Hsuan and Peng, Yifan and Leaman, Robert and Davis, Allan Peter and Mattingly, Carolyn J and Li, Jiao and Wiegers, Thomas C and Lu, Zhiyong}},
+  journal={{Database}},
+  volume={{2016}},
+  year={{2016}},
+  publisher={{Oxford Academic}}
+}}
+```
+And the BioBERT model:
+```bibtex
+@article{{lee2020biobert,
+  title={{BioBERT: a pre-trained biomedical language representation model for biomedical text mining}},
+  author={{Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo}},
+  journal={{Bioinformatics}},
+  volume={{36}},
+  number={{4}},
+  pages={{1234--1240}},
+  year={{2020}},
+  publisher={{Oxford University Press}}
+}}
+```
+## Contact
+- **Author**: {YOUR_NAME}
+- **Email**: {YOUR_EMAIL}
+- **GitHub**: [Your GitHub Profile](https://github.com/your-username)
+- **Project Repository**: [Link to your project repo]
+## Acknowledgments
+- Base model: [dmis-lab/biobert-v1.1](https://huggingface.co/dmis-lab/biobert-v1.1)
+- Dataset: [BC5CDR](https://biocreative.bioinformatics.udel.edu/)
+- Built with [HuggingFace Transformers](https://huggingface.co/transformers/)
+## License
+This model is released under the MIT License. See [LICENSE](LICENSE) for details.
+---
+*Model card last updated: {__import__('datetime').datetime.now().strftime('%Y-%m-%d')}*
+"""
+# Save to file
+model_path = "./biobert-ner-final"
+readme_path = f"{model_path}/README.md"
+with open(readme_path, "w", encoding="utf-8") as f:
+    f.write(model_card)
+print("✓ Model card created!")
+print(f"Saved to: {readme_path}")
+# Upload to HuggingFace
+api = HfApi()
+api.upload_file(
+    path_or_fileobj=readme_path,
+    path_in_repo="README.md",
+    repo_id=repo_id,
+    repo_type="model",
+)
+print(f"✓ Uploaded to: https://huggingface.co/{repo_id}")