globalbiodata
/

inventory

+---
+license: mit
+---
+# Model Card for named_entity_recognition.pt
+This is a fine-tuned model checkpoint for the named entity recognition (NER) task used in the biodata resource inventory performed by the
+[Global Biodata Coalition](https://globalbiodata.org/) in collaboration with [Chan Zuckerberg Initiative](https://chanzuckerberg.com/).
+# Model Details
+## Model Description
+This model has been fine-tuned to detect resource names in scientific articles (title and abstract). This is done using a token classification which assigns predicted
+token labels following the [BIO scheme](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). These are post-processed to determine the
+predicted "common names" (often an acronym) and "full names" of a resource present in an article.
+- **Developed by:** Ana-Maria Istrate and Kenneth E. Schackart III
+- **Shared by:** Kenneth E. Schackart III
+- **Model type:** RoBERTa (BERT; Transformer)
+- **Language(s) (NLP):** Python
+- **License:** MIT
+- **Finetuned from model:** https://huggingface.co/allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
+## Model Sources
+- **Repository:** https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev
+- **Paper [optional]:** TBA
+- **Demo [optional]:** TBA
+# Uses
+This model can be used find predicted biodata resource names in an article's title and abstract
+## Direct Use
+Direct use of the model has not been assessed or designed.
+## Out-of-Scope Use
+Model should not be used for anything other than the use described in [uses](named_entity_recognition_modelcard.md#uses).
+# Bias, Risks, and Limitations
+Biases may have been introduced at several stages of the development and training of this model. First, the model was trained on biomedical corpora
+as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Second, The model was fine-tuned on scientific articles that were
+manually annotated by 2 curators. Biases in the manual annotation may have affected model fine-tuning. Additionally, manually annotated data were
+procured using a specific search query to Europe PMC, so generalizability may be limited when applying to articles from other sources.
+## Recommendations
+The model should only be used for identifying resource names in articles from Europe PMC using the
+[query](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/config/query.txt) present in the GitHub repository.
+Additionally, only article predicted or known to describe a biodata resource should be used.
+## How to Get Started with the Model
+Follow the direction in the [GitHub repository](https://github.com/globalbiodata/inventory_2022/tree/inventory_2022_dev).
+# Training Details
+## Training Data
+The model was trained on the training split from the [labeled training data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
+*Note*: The data can be split into consistent training, validation, testing splits using the procedures detailed in the GitHub repository.
+## Training Procedure
+The model was trained for 10 epochs, and *F*1-score, precision, recall, and loss were computed after each epoch. The model checkpoint with the highest *F*1-score on the validation
+set was saved (regardless of epoch number).
+### Preprocessing
+To generate the input to the model, the article title and abstracts were concatenated, separating with one white space character, into a contiguous string. All
+XML tags were removed using a regular expression.
+### Speeds, Sizes, Times
+The model checkpoint is 496 MB. Speed has not been benchmarked.
+# Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+## Testing Data, Factors & Metrics
+### Testing Data
+<!-- This should link to a Data Card if possible. -->
+The model was evaluated using the test split of the [labeled data](https://github.com/globalbiodata/inventory_2022/blob/inventory_2022_dev/data/manual_ner_extraction.csv).
+### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+The model was evaluated using *F*1-score, precision, and recall. Precision was prioritized during fine-tuning and model selection.
+## Results
+- *F*1-score: 0.717
+- Precision: 0.689
+- Recall: 0.748
+### Summary
+# Model Examination
+The model works satisfactorily for identifying resource names from articles describing biodata resources in the literature.
+## Model Architecture and Objective
+The base model architecture is as described in [Gururangan S., *et al.,* 2020](http://arxiv.org/abs/2004.10964). Token classification is performed using
+a linear sequence classification layer initialized using [transformers.AutoModelForTokenClassification()](https://huggingface.co/docs/transformers/model_doc/auto).
+## Compute Infrastructure
+Model was fine-tuned on Google Colaboratory.
+### Hardware
+Model was fine-tuned using GPU acceleration provided by Google Colaboratory.
+### Software
+Training software was written in Python.
+# Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+TBA
+**BibTeX:**
+TBA
+**APA:**
+TBA
+# Model Card Authors
+This model card was written by Kenneth E. Schackart III.
+# Model Card Contact
+Ken Schackart: <schackartk1@gmail.com>