Commit ·
f01b848
1
Parent(s): 191a662
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,6 +1,10 @@
|
|
| 1 |
-
|
|
|
|
| 2 |
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
| 4 |
~~~
|
| 5 |
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
| 6 |
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
|
|
@@ -14,9 +18,8 @@ sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn S
|
|
| 14 |
|
| 15 |
NER_pipeline(sample)
|
| 16 |
NER_pipeline(sample2)
|
| 17 |
-
|
| 18 |
~~~
|
| 19 |
-
|
| 20 |
|
| 21 |
~~~
|
| 22 |
import pandas as pd
|
|
@@ -49,46 +52,43 @@ e = search('Homocystinuria')
|
|
| 49 |
e
|
| 50 |
~~~
|
| 51 |
|
| 52 |
-
|
| 53 |
-
## Model description
|
| 54 |
-
**EpiExtract4GARD** is a fine-tuned BioBERT model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). Specifically, this model is a *biobert-base-cased* model that was fine-tuned on a custom built epidemiologic dataset.
|
| 55 |
-
|
| 56 |
-
## Intended uses & limitations
|
| 57 |
-
#### How to use
|
| 58 |
-
You can use this model with Transformers *pipeline* for NER.
|
| 59 |
-
|
| 60 |
#### Limitations and bias
|
| 61 |
-
It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations.
|
| 62 |
-
This is also limited in numeracy due to BERT-based model's use of subword embeddings. This is crucial for l
|
| 63 |
-
|
| 64 |
## Training data
|
| 65 |
-
|
| 66 |
-
|
| 67 |
Abbreviation|Description
|
| 68 |
-
-|-
|
| 69 |
-
O|Outside of a named entity
|
| 70 |
B-LOC | Beginning of a location
|
| 71 |
I-LOC | Inside of a location
|
| 72 |
-
B-EPI | Beginning of an epidemiologic
|
| 73 |
-
I-EPI | Epidemiologic
|
| 74 |
B-STAT | Beginning of an epidemiologic rate
|
| 75 |
I-STAT | Inside of an epidemiologic rate
|
| 76 |
|
| 77 |
-
# PENDING DOCUMENTATION:
|
| 78 |
-
|
| 79 |
### EpiSet Statistics
|
| 80 |
-
This dataset was derived from 620 epidemiological abstracts on rare diseases. The training and validation sets were programmatically labeled using spaCy and rules implemented [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/create_labeled_dataset_V2.ipynb) and then [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/modify_existing_labels.ipynb).
|
| 81 |
|
| 82 |
-
|
| 83 |
-
#### # of training examples per entity type
|
| 84 |
|
| 85 |
## Training procedure
|
| 86 |
-
This model was trained on a
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Model description
|
| 2 |
+
**EpiExtract4GARD** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER) for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
|
| 3 |
|
| 4 |
+
## Intended uses & limitations
|
| 5 |
+
#### How to use
|
| 6 |
+
You can use this model
|
| 7 |
+
See code below for use with Transformers *pipeline* for NER.:
|
| 8 |
~~~
|
| 9 |
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
| 10 |
model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
|
|
|
|
| 18 |
|
| 19 |
NER_pipeline(sample)
|
| 20 |
NER_pipeline(sample2)
|
|
|
|
| 21 |
~~~
|
| 22 |
+
Or you can use this model with the entire EpiExtract4GARD pipeline if you download [*classify_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/classify_abs.py), [*extract_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/extract_abs.py), and [*gard-id-name-synonyms.json*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard-id-name-synonyms.json) from GitHub then you can test with this [*additional* code](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/Case%20Study.ipynb):
|
| 23 |
|
| 24 |
~~~
|
| 25 |
import pandas as pd
|
|
|
|
| 52 |
e
|
| 53 |
~~~
|
| 54 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
#### Limitations and bias
|
|
|
|
|
|
|
|
|
|
| 56 |
## Training data
|
| 57 |
+
It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
|
| 58 |
+
|
| 59 |
Abbreviation|Description
|
| 60 |
+
---------|--------------
|
| 61 |
+
O |Outside of a named entity
|
| 62 |
B-LOC | Beginning of a location
|
| 63 |
I-LOC | Inside of a location
|
| 64 |
+
B-EPI | Beginning of an epidemiologic type (e.g. "incidence", "prevalence", "occurrence")
|
| 65 |
+
I-EPI | Epidemiologic type that is not the beginning token.
|
| 66 |
B-STAT | Beginning of an epidemiologic rate
|
| 67 |
I-STAT | Inside of an epidemiologic rate
|
| 68 |
|
|
|
|
|
|
|
| 69 |
### EpiSet Statistics
|
|
|
|
| 70 |
|
| 71 |
+
Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Additionally, more recent weakly supervised learning techniques could be used to improve the performance of the model without improving the underlying dataset.
|
|
|
|
| 72 |
|
| 73 |
## Training procedure
|
| 74 |
+
This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
|
| 75 |
+
4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time. Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml).
|
| 76 |
+
|
| 77 |
+
## Hold-out validation results
|
| 78 |
+
metric| entity-level result
|
| 79 |
+
-|-
|
| 80 |
+
f1 | 83.8
|
| 81 |
+
precision | 83.2
|
| 82 |
+
recall | 84.5
|
| 83 |
+
|
| 84 |
+
## Test results
|
| 85 |
+
| Dataset for Model Training | Evaluation Level | Entity | Precision | Recall | F1 |
|
| 86 |
+
|:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:|
|
| 87 |
+
| EpiSet | Entity-Level | Overall | 0.556 | 0.662 | 0.605 |
|
| 88 |
+
| | | Location | 0.661 | 0.696 | 0.678 |
|
| 89 |
+
| | | Epidemiologic Type | 0.854 | 0.911 | 0.882 |
|
| 90 |
+
| | | Epidemiologic Rate | 0.143 | 0.218 | 0.173 |
|
| 91 |
+
| | Token-Level | Overall | 0.811 | 0.713 | 0.759 |
|
| 92 |
+
| | | Location | 0.949 | 0.742 | 0.833 |
|
| 93 |
+
| | | Epidemiologic Type | 0.9 | 0.917 | 0.908 |
|
| 94 |
+
| | | Epidemiologic Rate | 0.724 | 0.636 | 0.677 |
|