File size: 5,403 Bytes
d149041
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
license: mit
---

# Model Card for named_entity_recognition.pt

This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the
[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).

# Model Details

## Model Description

This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted
token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the
predicted "common names" (often an acronym) and "full names" of a resource present in an article.



- **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III
- **Shared by:** Kenneth E. Schackart III
- **Model type:** RoBERTa (BERT; Transformer)
- **Language(s) (NLP):** Python
- **License:** MIT
- **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500

## Model Sources

- **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
- **Paper [optional]:** TBA
- **Demo [optional]:** TBA

# Uses

This model can be used find predicted biodata resource names in an article's title and abstract

## Direct Use

Direct use of the model has not been assessed or designed.

## Out-of-Scope Use

Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses).

# Bias, Risks, and Limitations

Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were
procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.

## Recommendations

The model should only be used for identifying resource names in articles from Europe PMC using the
[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
Additionally, only article predicted or known to describe a biodata resource should be used.

## How to Get Started with the Model

Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).

# Training Details

## Training Data

The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).

*Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.

## Training Procedure

The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest *F*1-score on the validation
set was saved (regardless of epoch number).

### Preprocessing

To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
XML tags were removed using a regular expression.

### Speeds, Sizes, Times

The model checkpoint is 496 MB. Speed has not been benchmarked.

# Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

## Testing Data, Factors & Metrics

### Testing Data

<!-- This should link to a Data Card if possible. -->

The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).

### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.

## Results

- *F*1-score: 0.717
- Precision: 0.689
- Recall: 0.748

### Summary



# Model Examination

The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.

## Model Architecture and Objective

The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using
a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto).

## Compute Infrastructure

Model was fine-tuned on Google Colaboratory.

### Hardware

Model was fine-tuned using GPU acceleration provided by Google Colaboratory.

### Software

Training software was written in Python.

# Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

TBA

**BibTeX:**

TBA

**APA:**

TBA

# Model Card Authors

This model card was written by Kenneth E. Schackart III.

# Model Card Contact

Ken Schackart: <schackartk1@gmail.com>