ncats
/

EpiExtract4GARD-v1

Token Classification

Transformers

PyTorch

bert

Model card Files Files and versions

xet

Community

wzkariampuzha commited on Oct 12, 2021

Commit

f01b848

1 Parent(s): 191a662

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -35

README.md CHANGED Viewed

@@ -1,6 +1,10 @@
-This [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model was fine-tuned for epidemiological information from rare disease abstracts. It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
-Use this code to test
 ~~~
 from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
 model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
@@ -14,9 +18,8 @@ sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn S
 NER_pipeline(sample)
 NER_pipeline(sample2)
 ~~~
-If you download *classify_abs.py*, *extract_abs.py*, and *gard-id-name-synonyms.json* from [GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) then you can test with this [*additional* code](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/Case%20Study.ipynb):
 ~~~
 import pandas as pd
@@ -49,46 +52,43 @@ e = search('Homocystinuria')
 e
 ~~~
-## Model description
-**EpiExtract4GARD** is a fine-tuned BioBERT model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). Specifically, this model is a *biobert-base-cased* model that was fine-tuned on a custom built epidemiologic dataset.
-## Intended uses & limitations
-#### How to use
-You can use this model with Transformers *pipeline* for NER.
 #### Limitations and bias
-It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations.
-This is also limited in numeracy due to BERT-based model's use of subword embeddings. This is crucial for l
 ## Training data
-This model was fine-tuned on English version of the standard [CoNLL-2003 Named Entity Recognition](https://www.aclweb.org/anthology/W03-0419.pdf) dataset.
-The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
 Abbreviation|Description
--|-
-O|Outside of a named entity
 B-LOC    | Beginning of a location
 I-LOC    | Inside of a location
-B-EPI    | Beginning of an epidemiologic identifier (e.g. "incidence", "prevalence", "occurrence")
-I-EPI    | Epidemiologic identifier that is not the beginning token.
 B-STAT   | Beginning of an epidemiologic rate
 I-STAT   | Inside of an epidemiologic rate
-# PENDING DOCUMENTATION:
 ### EpiSet Statistics
-This dataset was derived from 620 epidemiological abstracts on rare diseases. The training and validation sets were programmatically labeled using spaCy and rules implemented [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/create_labeled_dataset_V2.ipynb) and then [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/modify_existing_labels.ipynb).
-UPDATE
-#### # of training examples per entity type
 ## Training procedure
-This model was trained on a single NVIDIA V100 GPU with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805) which trained & evaluated the model on CoNLL-2003 NER task.
-## Eval results
-metric|dev|test
--|-|-
-f1 |95.1 |91.3
-precision |95.0 |90.7
-recall |95.3 |91.9
-The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results [here](https://github.com/google-research/bert/issues/223).

+## Model description
+**EpiExtract4GARD** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER) for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
+## Intended uses & limitations
+#### How to use
+You can use this model
+See code below for use with Transformers *pipeline* for NER.:
 ~~~
 from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
 model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
 NER_pipeline(sample)
 NER_pipeline(sample2)
 ~~~
+Or you can use this model with the entire EpiExtract4GARD pipeline if you download [*classify_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/classify_abs.py), [*extract_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/extract_abs.py), and [*gard-id-name-synonyms.json*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard-id-name-synonyms.json) from GitHub then you can test with this [*additional* code](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/Case%20Study.ipynb):
 ~~~
 import pandas as pd
 e
 ~~~
 #### Limitations and bias
 ## Training data
+It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
 Abbreviation|Description
+---------|--------------
+O        |Outside of a named entity
 B-LOC    | Beginning of a location
 I-LOC    | Inside of a location
+B-EPI    | Beginning of an epidemiologic type (e.g. "incidence", "prevalence", "occurrence")
+I-EPI    | Epidemiologic type that is not the beginning token.
 B-STAT   | Beginning of an epidemiologic rate
 I-STAT   | Inside of an epidemiologic rate
 ### EpiSet Statistics
+Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Additionally, more recent weakly supervised learning techniques could be used to improve the performance of the model without improving the underlying dataset.
 ## Training procedure
+This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
+4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time. Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml).
+## Hold-out validation  results
+metric| entity-level result
+-|-
+f1 | 83.8
+precision | 83.2
+recall | 84.5
+## Test results
+| Dataset for Model Training | Evaluation Level |       Entity       | Precision | Recall |   F1  |
+|:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:|
+|           EpiSet           |   Entity-Level   |      Overall       |   0.556   |  0.662 | 0.605 |
+|                            |                  |      Location      |   0.661   |  0.696 | 0.678 |
+|                            |                  | Epidemiologic Type |   0.854   |  0.911 | 0.882 |
+|                            |                  | Epidemiologic Rate |   0.143   |  0.218 | 0.173 |
+|                            |    Token-Level   |      Overall       |   0.811   |  0.713 | 0.759 |
+|                            |                  |      Location      |   0.949   |  0.742 | 0.833 |
+|                            |                  | Epidemiologic Type |    0.9    |  0.917 | 0.908 |
+|                            |                  | Epidemiologic Rate |   0.724   |  0.636 | 0.677 |