| | --- |
| | library_name: transformers |
| | license: mit |
| | --- |
| | |
| | # distilbert-medication-ner |
| |
|
| | This model is a fine-tuned version of [distilbert-base-cased](https://huggingface.co/distilbert/distilbert-base-cased) on synthetically generated medication data by [Synthea](https://github.com/synthetichealth/synthea). |
| |
|
| | More details on how this model was trained can be found on [GitHub](https://github.com/JackLeeJM/slm-medication-ner). |
| |
|
| | ## Model Description |
| |
|
| | A fine-tuned NER model developed to handle 5 specific entities (i.e. DRUG, DOSAGE, ROUTE, BRAND, QUANTITY) when processing medication strings such as: |
| | - Ibuprofen 100 MG Oral Tablet |
| | - 1 ML medroxyprogesterone acetate 150 MG/ML Injection |
| | - Acetaminophen 325 MG / Oxycodone Hydrochloride 10 MG Oral Tablet [Percocet] |
| |
|
| | The model was trained and evaluated on limited manually annotated datasets (i.e. train_n_samples=309, eval_n_samples=335), achieved the following evaluation metrics: |
| | - **Precision**: 0.998 |
| | - **Recall**: 0.983 |
| | - **F1**: 0.991 |
| |
|
| | ## Usage |
| |
|
| | 1. Load model: |
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForTokenClassification |
| | |
| | model_name = "jackleejm/distilbert-medication-ner" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForTokenClassification.from_pretrained(model_name) |
| | ``` |
| | 2. Setup a pipeline and run inferences: |
| | ```python |
| | from transformers import pipeline |
| | |
| | ner_pipeline = pipeline( |
| | task="token-classification", |
| | model=model, |
| | tokenizer=tokenizer, |
| | aggregation_strategy="simple", |
| | device_map="auto", |
| | ) |
| | |
| | input = ["Acetaminophen 325 MG Oral Tablet"] |
| | results = ner_pipeline(input) |
| | |
| | print(results) |
| | |
| | # Outputs |
| | [ |
| | [ |
| | { |
| | "word": "Acetaminophen", |
| | "score": np.float32(0.99948627), |
| | "entity_group": "DRUG", |
| | "start": 0, |
| | "end": 13 |
| | }, |
| | { |
| | "word": "325 MG", |
| | "score": np.float32(0.99882394), |
| | "entity_group": "DOSAGE", |
| | "start": 14, |
| | "end": 20 |
| | }, |
| | { |
| | "word": "Oral Tablet", |
| | "score": np.float32(0.9994621), |
| | "entity_group": "ROUTE", |
| | "start": 21, |
| | "end": 32 |
| | } |
| | ] |
| | ] |
| | ``` |
| |
|
| | ### Training Procedure |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - learning_rate: 2e-5 |
| | - per_device_train_batch_size: 16 |
| | - per_device_eval_batch_size: 16 |
| | - num_train_epochs: 20 |
| | - weight_decay: 0.01 |
| | - evaluation_strategy: "steps" |
| | - eval_steps: 50 |
| | - load_best_model_at_end: True |
| | - metric_for_best_model: "f1" |
| | |
| | |
| | ## Framework versions |
| | - Transformers 4.49.0 |
| | - Pytorch 2.6.0 |
| | - Datasets 3.3.2 |
| | - Tokenizers 0.21.0 |
| | |