wzkariampuzha commited on
Commit
f01b848
·
1 Parent(s): 191a662

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -35
README.md CHANGED
@@ -1,6 +1,10 @@
1
- This [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model was fine-tuned for epidemiological information from rare disease abstracts. It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
 
2
 
3
- Use this code to test
 
 
 
4
  ~~~
5
  from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
6
  model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
@@ -14,9 +18,8 @@ sample2 = "Early Diagnosis of Classic Homocystinuria in Kuwait through Newborn S
14
 
15
  NER_pipeline(sample)
16
  NER_pipeline(sample2)
17
-
18
  ~~~
19
- If you download *classify_abs.py*, *extract_abs.py*, and *gard-id-name-synonyms.json* from [GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) then you can test with this [*additional* code](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/Case%20Study.ipynb):
20
 
21
  ~~~
22
  import pandas as pd
@@ -49,46 +52,43 @@ e = search('Homocystinuria')
49
  e
50
  ~~~
51
 
52
-
53
- ## Model description
54
- **EpiExtract4GARD** is a fine-tuned BioBERT model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). Specifically, this model is a *biobert-base-cased* model that was fine-tuned on a custom built epidemiologic dataset.
55
-
56
- ## Intended uses & limitations
57
- #### How to use
58
- You can use this model with Transformers *pipeline* for NER.
59
-
60
  #### Limitations and bias
61
- It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations.
62
- This is also limited in numeracy due to BERT-based model's use of subword embeddings. This is crucial for l
63
-
64
  ## Training data
65
- This model was fine-tuned on English version of the standard [CoNLL-2003 Named Entity Recognition](https://www.aclweb.org/anthology/W03-0419.pdf) dataset.
66
- The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
67
  Abbreviation|Description
68
- -|-
69
- O|Outside of a named entity
70
  B-LOC | Beginning of a location
71
  I-LOC | Inside of a location
72
- B-EPI | Beginning of an epidemiologic identifier (e.g. "incidence", "prevalence", "occurrence")
73
- I-EPI | Epidemiologic identifier that is not the beginning token.
74
  B-STAT | Beginning of an epidemiologic rate
75
  I-STAT | Inside of an epidemiologic rate
76
 
77
- # PENDING DOCUMENTATION:
78
-
79
  ### EpiSet Statistics
80
- This dataset was derived from 620 epidemiological abstracts on rare diseases. The training and validation sets were programmatically labeled using spaCy and rules implemented [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/create_labeled_dataset_V2.ipynb) and then [here](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/modify_existing_labels.ipynb).
81
 
82
- UPDATE
83
- #### # of training examples per entity type
84
 
85
  ## Training procedure
86
- This model was trained on a single NVIDIA V100 GPU with recommended hyperparameters from the [original BERT paper](https://arxiv.org/pdf/1810.04805) which trained & evaluated the model on CoNLL-2003 NER task.
87
-
88
- ## Eval results
89
- metric|dev|test
90
- -|-|-
91
- f1 |95.1 |91.3
92
- precision |95.0 |90.7
93
- recall |95.3 |91.9
94
- The test metrics are a little lower than the official Google BERT results which encoded document context & experimented with CRF. More on replicating the original results [here](https://github.com/google-research/bert/issues/223).
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Model description
2
+ **EpiExtract4GARD** is a fine-tuned [BioBERT-base-cased](https://huggingface.co/dmis-lab/biobert-base-cased-v1.1) model that is ready to use for **Named Entity Recognition** of locations (LOC), epidemiologic types (EPI), and epidemiologic rates (STAT). This model was fine-tuned on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER) for epidemiological information from rare disease abstracts. See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. See [EpiExtract4GARD on GitHub](https://github.com/ncats/epi4GARD/tree/master/EpiExtract4GARD#epiextract4gard) for details on the entire pipeline.
3
 
4
+ ## Intended uses & limitations
5
+ #### How to use
6
+ You can use this model
7
+ See code below for use with Transformers *pipeline* for NER.:
8
  ~~~
9
  from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
10
  model = AutoModelForTokenClassification.from_pretrained("ncats/EpiExtract4GARD")
 
18
 
19
  NER_pipeline(sample)
20
  NER_pipeline(sample2)
 
21
  ~~~
22
+ Or you can use this model with the entire EpiExtract4GARD pipeline if you download [*classify_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/classify_abs.py), [*extract_abs.py*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/extract_abs.py), and [*gard-id-name-synonyms.json*](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/gard-id-name-synonyms.json) from GitHub then you can test with this [*additional* code](https://github.com/ncats/epi4GARD/blob/master/EpiExtract4GARD/Case%20Study.ipynb):
23
 
24
  ~~~
25
  import pandas as pd
 
52
  e
53
  ~~~
54
 
 
 
 
 
 
 
 
 
55
  #### Limitations and bias
 
 
 
56
  ## Training data
57
+ It was trained on [EpiSet4NER](https://huggingface.co/datasets/ncats/EpiSet4NER). See dataset documentation for details on the weakly supervised teaching methods and dataset biases and limitations. The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. As in the dataset, each token will be classified as one of the following classes:
58
+
59
  Abbreviation|Description
60
+ ---------|--------------
61
+ O |Outside of a named entity
62
  B-LOC | Beginning of a location
63
  I-LOC | Inside of a location
64
+ B-EPI | Beginning of an epidemiologic type (e.g. "incidence", "prevalence", "occurrence")
65
+ I-EPI | Epidemiologic type that is not the beginning token.
66
  B-STAT | Beginning of an epidemiologic rate
67
  I-STAT | Inside of an epidemiologic rate
68
 
 
 
69
  ### EpiSet Statistics
 
70
 
71
+ Beyond any limitations due to the EpiSet4NER dataset, this model is limited in numeracy due to BERT-based model's use of subword embeddings, which is crucial for epidemiologic rate identification and limits the entity-level results. Additionally, more recent weakly supervised learning techniques could be used to improve the performance of the model without improving the underlying dataset.
 
72
 
73
  ## Training procedure
74
+ This model was trained on a [AWS EC2 p3.2xlarge](https://aws.amazon.com/ec2/instance-types/), which utilized a single Tesla V100 GPU, with these hyperparameters:
75
+ 4 epochs of training (AdamW weight decay = 0.05) with a batch size of 16. Maximum sequence length = 192. Model was fed one sentence at a time. Full config [here](https://wandb.ai/wzkariampuzha/huggingface/runs/353prhts/files/config.yaml).
76
+
77
+ ## Hold-out validation results
78
+ metric| entity-level result
79
+ -|-
80
+ f1 | 83.8
81
+ precision | 83.2
82
+ recall | 84.5
83
+
84
+ ## Test results
85
+ | Dataset for Model Training | Evaluation Level | Entity | Precision | Recall | F1 |
86
+ |:--------------------------:|:----------------:|:------------------:|:---------:|:------:|:-----:|
87
+ | EpiSet | Entity-Level | Overall | 0.556 | 0.662 | 0.605 |
88
+ | | | Location | 0.661 | 0.696 | 0.678 |
89
+ | | | Epidemiologic Type | 0.854 | 0.911 | 0.882 |
90
+ | | | Epidemiologic Rate | 0.143 | 0.218 | 0.173 |
91
+ | | Token-Level | Overall | 0.811 | 0.713 | 0.759 |
92
+ | | | Location | 0.949 | 0.742 | 0.833 |
93
+ | | | Epidemiologic Type | 0.9 | 0.917 | 0.908 |
94
+ | | | Epidemiologic Rate | 0.724 | 0.636 | 0.677 |