Dr. Jorge Abreu Vicente commited on
Commit ·
c99d44a
1
Parent(s): dc342e4
Update README.md
Browse files
README.md
CHANGED
|
@@ -45,15 +45,53 @@ It achieves the following results on the evaluation set:
|
|
| 45 |
|
| 46 |
## Model description
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
## Intended uses & limitations
|
| 51 |
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
## Training and evaluation data
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
## Training procedure
|
| 59 |
|
|
@@ -76,6 +114,25 @@ The following hyperparameters were used during training:
|
|
| 76 |
| 0.0885 | 2.0 | 4132 | 0.1685 | 0.9438 | 0.7377 | 0.8168 | 0.7752 |
|
| 77 |
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
### Framework versions
|
| 80 |
|
| 81 |
- Transformers 4.15.0
|
|
|
|
| 45 |
|
| 46 |
## Model description
|
| 47 |
|
| 48 |
+
The generation of this model is explained in more detail in Abreu-Vicente & Lemberger (in prep).
|
| 49 |
+
The model is fine-tuned from [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large).
|
| 50 |
+
The use of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) was decided after proceeding to the analysis of 14 different models
|
| 51 |
+
in the [SourceData](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) dataset.
|
| 52 |
+
|
| 53 |
+
### The SourceData dataset
|
| 54 |
+
|
| 55 |
+
This dataset is based on the content of the SourceData (https://sourcedata.embo.org) database, which contains manually annotated figure legends written in English and extracted from scientific papers in the domain of cell and molecular biology (Liechti et al, Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471). Unlike the dataset sd-nlp, pre-tokenized with the roberta-base tokenizer, this dataset is not previously tokenized, but just splitted into words. Users can therefore use it to fine-tune other models. Additional details at https://github.com/source-data/soda-roberta
|
| 56 |
+
|
| 57 |
+
The dataset in the 🤗 Hub is just a processed version of the entire annotated dataset that is presented also in Abreu-Vicente & Lemberger (in prep).
|
| 58 |
+
Further details on the entire dataset can be found in the [BCVI BIO-ID track](https://biocreative.bioinformatics.udel.edu/resources/corpora/bcvi-bio-id-track/) task associated.
|
| 59 |
+
|
| 60 |
+
This model is fine-tuned in the biological `NER` task. On it, biological and chemical entities are labeled. Specifically the following entities are tagged:
|
| 61 |
+
|
| 62 |
+
`NER`: biological and chemical entities are labeled. Specifically the following entities are tagged:
|
| 63 |
+
- `SMALL_MOLECULE`: small molecules
|
| 64 |
+
- `GENEPROD`: gene products (genes and proteins)
|
| 65 |
+
- `SUBCELLULAR`: subcellular components
|
| 66 |
+
- `CELL`: cell types and cell lines.
|
| 67 |
+
- `TISSUE`: tissues and organs
|
| 68 |
+
- `ORGANISM`: species
|
| 69 |
+
- `EXP_ASSAY`: experimental assays
|
| 70 |
|
| 71 |
## Intended uses & limitations
|
| 72 |
|
| 73 |
+
The intended use of this model is for Named Entity Recognition of biological entities used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
|
| 74 |
+
|
| 75 |
+
To have a quick check of the model:
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
|
| 79 |
+
example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4-HA strain. The band with the # might corresponds to a dimer of Upf1-CH, bands marked with a star correspond to residual signal with the anti-HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
|
| 80 |
+
tokenizer = AutoTokenizer.from_pretrained('EMBO/sd-ner-v2', max_len=512)
|
| 81 |
+
model = AutoModelForTokenClassification.from_pretrained('EMBO/sd-ner-v2')
|
| 82 |
+
ner = pipeline('ner', model, tokenizer=tokenizer)
|
| 83 |
+
res = ner(example)
|
| 84 |
+
for r in res:
|
| 85 |
+
print(r['word'], r['entity'])
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
### Possible limitations
|
| 89 |
+
|
| 90 |
+
The model has been trained on pre-tokenized words. Although in general the SentencePiece tokenizer and part of the pre-processing included in the 🤗 tokenizers library seem to do a good job, this might generate some issues related to the use of white spaces between characters.
|
| 91 |
|
| 92 |
## Training and evaluation data
|
| 93 |
|
| 94 |
+
The training, evaluation, and test splits of the data used can be found in [SourceData dataset](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized).
|
| 95 |
|
| 96 |
## Training procedure
|
| 97 |
|
|
|
|
| 114 |
| 0.0885 | 2.0 | 4132 | 0.1685 | 0.9438 | 0.7377 | 0.8168 | 0.7752 |
|
| 115 |
|
| 116 |
|
| 117 |
+
## Performance of the model in the training dataset
|
| 118 |
+
|
| 119 |
+
```
|
| 120 |
+
precision recall f1-score support
|
| 121 |
+
CELL 0.71 0.79 0.75 4948
|
| 122 |
+
EXP_ASSAY 0.59 0.60 0.60 9885
|
| 123 |
+
GENEPROD 0.79 0.89 0.84 21865
|
| 124 |
+
ORGANISM 0.72 0.85 0.78 3464
|
| 125 |
+
SMALL_MOLECULE 0.72 0.81 0.76 6431
|
| 126 |
+
SUBCELLULAR 0.72 0.77 0.74 3850
|
| 127 |
+
TISSUE 0.68 0.76 0.72 2975
|
| 128 |
+
|
| 129 |
+
micro avg 0.72 0.80 0.76
|
| 130 |
+
macro avg 0.70 0.78 0.74 53418
|
| 131 |
+
weighted avg 0.72 0.80 0.76 53418
|
| 132 |
+
|
| 133 |
+
{'test_loss': 0.16807569563388824, 'test_accuracy_score': 0.9427137503742414, 'test_precision': 0.7242540660382148, 'test_recall': 0.8011157287805608, 'test_f1': 0.7607484111817252, 'test_runtime': 88.1851, 'test_samples_per_second': 93.27, 'test_steps_per_second': 0.374}
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
### Framework versions
|
| 137 |
|
| 138 |
- Transformers 4.15.0
|