EMBO
/

sd-ner-v2

@@ -45,15 +45,53 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
@@ -76,6 +114,25 @@ The following hyperparameters were used during training:
 | 0.0885        | 2.0   | 4132 | 0.1685          | 0.9438         | 0.7377    | 0.8168 | 0.7752 |
 ### Framework versions
 - Transformers 4.15.0

 ## Model description
+The generation of this model is explained in more detail in Abreu-Vicente & Lemberger (in prep).
+The model is fine-tuned from [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large).
+The use of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) was decided after proceeding to the analysis of 14 different models
+in the [SourceData](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) dataset.
+### The SourceData dataset
+This dataset is based on the content of the SourceData (https://sourcedata.embo.org) database, which contains manually annotated figure legends written in English and extracted from scientific papers in the domain of cell and molecular biology (Liechti et al, Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471). Unlike the dataset sd-nlp, pre-tokenized with the roberta-base tokenizer, this dataset is not previously tokenized, but just splitted into words. Users can therefore use it to fine-tune other models. Additional details at https://github.com/source-data/soda-roberta
+The dataset in the 🤗 Hub is just a processed version of the entire annotated dataset that is presented also in Abreu-Vicente & Lemberger (in prep).
+Further details on the entire dataset can be found in the [BCVI BIO-ID track](https://biocreative.bioinformatics.udel.edu/resources/corpora/bcvi-bio-id-track/) task associated.
+This model is fine-tuned in the biological `NER` task. On it, biological and chemical entities are labeled. Specifically the following entities are tagged:
+`NER`: biological and chemical entities are labeled. Specifically the following entities are tagged:
+- `SMALL_MOLECULE`: small molecules
+- `GENEPROD`: gene products (genes and proteins)
+- `SUBCELLULAR`: subcellular components
+- `CELL`: cell types and cell lines.
+- `TISSUE`: tissues and organs
+- `ORGANISM`: species
+- `EXP_ASSAY`: experimental assays
 ## Intended uses & limitations
+The intended use of this model is for Named Entity Recognition of biological entities used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
+To have a quick check of the model:
+```python
+  from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
+  example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4-HA strain. The band with the # might corresponds to a dimer of Upf1-CH, bands marked with a star correspond to residual signal with the anti-HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
+  tokenizer = AutoTokenizer.from_pretrained('EMBO/sd-ner-v2', max_len=512)
+  model = AutoModelForTokenClassification.from_pretrained('EMBO/sd-ner-v2')
+  ner = pipeline('ner', model, tokenizer=tokenizer)
+  res = ner(example)
+  for r in res:
+      print(r['word'], r['entity'])
+```
+### Possible limitations
+The model has been trained on pre-tokenized words. Although in general the SentencePiece tokenizer and part of the pre-processing included in the 🤗 tokenizers library seem to do a good job, this might generate some issues related to the use of white spaces between characters.
 ## Training and evaluation data
+The training, evaluation, and test splits of the data used can be found in [SourceData dataset](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized).
 ## Training procedure
 | 0.0885        | 2.0   | 4132 | 0.1685          | 0.9438         | 0.7377    | 0.8168 | 0.7752 |
+## Performance of the model in the training dataset
+```
+              precision    recall  f1-score support
+  CELL           0.71      0.79      0.75      4948
+  EXP_ASSAY      0.59      0.60      0.60      9885
+  GENEPROD       0.79      0.89      0.84     21865
+  ORGANISM       0.72      0.85      0.78      3464
+  SMALL_MOLECULE 0.72      0.81      0.76      6431
+  SUBCELLULAR    0.72      0.77      0.74      3850
+  TISSUE         0.68      0.76      0.72      2975
+  micro avg       0.72      0.80      0.76
+  macro avg       0.70      0.78      0.74     53418
+  weighted avg    0.72      0.80      0.76     53418
+{'test_loss': 0.16807569563388824, 'test_accuracy_score': 0.9427137503742414, 'test_precision': 0.7242540660382148, 'test_recall': 0.8011157287805608, 'test_f1': 0.7607484111817252, 'test_runtime': 88.1851, 'test_samples_per_second': 93.27, 'test_steps_per_second': 0.374}
+```
 ### Framework versions
 - Transformers 4.15.0