Dr. Jorge Abreu Vicente commited on
Commit
304bd5e
·
1 Parent(s): 8315fe7

update model card README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -82
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: apache-2.0
3
  tags:
4
  - generated_from_trainer
5
  datasets:
@@ -21,18 +21,13 @@ model-index:
21
  metrics:
22
  - name: Precision
23
  type: precision
24
- value: 0.7376642982556899
25
  - name: Recall
26
  type: recall
27
- value: 0.8167811330353574
28
  - name: F1
29
  type: f1
30
- value: 0.7752093051328258
31
-
32
- widget:
33
- - text: "Confocal images of Bmm-GFP (green) in 3rd instar larval fat bodies of different genotypes. DAPI (blue) stains nuclei. Scale bar represents 25 µm. (A) Knocking down CSN2 or overexpressing RDH/CG2064 in animals with DGAT1 overexpression (ppl>DGAT1) decreases the level and lipid droplet localization of Bmm-GFP."
34
- - text: "The GFP intensity along the line across a lipid droplet in (A) was measured by ImageJ.The lipid droplet localization of Bmm-GFP, represented by two peaks, is clearly visible in fat cells from ppl > DGAT1 larvae , but it is lost in fat cells from ppl > DGAT1 larvae with CSN2 RNAi or overexpression of RDH/CG2064. More than 30 lipid droplets of each genotype were measured. One typical image curve is shown for each genotype."
35
- - text: "XPT of siRNA treated DC3. 2R cells after 48 hours of knockdown. Treated cells were fed with the indicated amounts of C8L peptid conjugated to iron oxide beads via a disulfide bond. The cells were then exposed to RF33. 70-Luc Reporter CD8 T cells overnight. Error bars show SD of >3 replicate wells. * p<0.05 for siRNA vs control I-Ab using two-way ANOVA. Representative plot of 3 independent experiments."
36
  ---
37
 
38
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -40,71 +35,33 @@ should probably proofread and complete it, then remove this comment. -->
40
 
41
  # sd-ner-v2
42
 
43
- This model is a fine-tuned version of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) on the source_data_nlp dataset.
44
  It achieves the following results on the evaluation set:
45
- - Loss: 0.1685
46
- - Accuracy Score: 0.9438
47
- - Precision: 0.7377
48
- - Recall: 0.8168
49
- - F1: 0.7752
50
 
51
  ## Model description
52
 
53
- The generation of this model is explained in more detail in Abreu-Vicente & Lemberger (in prep).
54
- The model is fine-tuned from [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large).
55
- The use of [michiyasunaga/BioLinkBERT-large](https://huggingface.co/michiyasunaga/BioLinkBERT-large) was decided after proceeding to the analysis of 14 different models
56
- in the [SourceData](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized) dataset.
57
-
58
- ### The SourceData dataset
59
-
60
- This dataset is based on the content of the SourceData (https://sourcedata.embo.org) database, which contains manually annotated figure legends written in English and extracted from scientific papers in the domain of cell and molecular biology (Liechti et al, Nature Methods, 2017, https://doi.org/10.1038/nmeth.4471). Unlike the dataset sd-nlp, pre-tokenized with the roberta-base tokenizer, this dataset is not previously tokenized, but just splitted into words. Users can therefore use it to fine-tune other models. Additional details at https://github.com/source-data/soda-roberta
61
-
62
- The dataset in the 🤗 Hub is just a processed version of the entire annotated dataset that is presented also in Abreu-Vicente & Lemberger (in prep).
63
- Further details on the entire dataset can be found in the [BCVI BIO-ID track](https://biocreative.bioinformatics.udel.edu/resources/corpora/bcvi-bio-id-track/) task associated.
64
-
65
- This model is fine-tuned in the biological `NER` task. On it, biological and chemical entities are labeled. Specifically the following entities are tagged:
66
-
67
- `NER`: biological and chemical entities are labeled. Specifically the following entities are tagged:
68
- - `SMALL_MOLECULE`: small molecules
69
- - `GENEPROD`: gene products (genes and proteins)
70
- - `SUBCELLULAR`: subcellular components
71
- - `CELL`: cell types and cell lines.
72
- - `TISSUE`: tissues and organs
73
- - `ORGANISM`: species
74
- - `EXP_ASSAY`: experimental assays
75
 
76
  ## Intended uses & limitations
77
 
78
- The intended use of this model is for Named Entity Recognition of biological entities used in SourceData annotations (https://sourcedata.embo.org), including small molecules, gene products (genes and proteins), subcellular components, cell line and cell types, organ and tissues, species as well as experimental methods.
79
-
80
- To have a quick check of the model:
81
-
82
- ```python
83
- from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
84
- example = """<s> F. Western blot of input and eluates of Upf1 domains purification in a Nmd4-HA strain. The band with the # might corresponds to a dimer of Upf1-CH, bands marked with a star correspond to residual signal with the anti-HA antibodies (Nmd4). Fragments in the eluate have a smaller size because the protein A part of the tag was removed by digestion with the TEV protease. G6PDH served as a loading control in the input samples </s>"""
85
- tokenizer = AutoTokenizer.from_pretrained('EMBO/sd-ner-v2', max_len=512)
86
- model = AutoModelForTokenClassification.from_pretrained('EMBO/sd-ner-v2')
87
- ner = pipeline('ner', model, tokenizer=tokenizer)
88
- res = ner(example)
89
- for r in res:
90
- print(r['word'], r['entity'])
91
- ```
92
-
93
- ### Possible limitations
94
-
95
- The model has been trained on pre-tokenized words. Although in general the SentencePiece tokenizer and part of the pre-processing included in the 🤗 tokenizers library seem to do a good job, this might generate some issues related to the use of white spaces between characters.
96
 
97
  ## Training and evaluation data
98
 
99
- The training, evaluation, and test splits of the data used can be found in [SourceData dataset](https://huggingface.co/datasets/EMBO/sd-nlp-non-tokenized).
100
 
101
  ## Training procedure
102
 
103
  ### Training hyperparameters
104
 
105
  The following hyperparameters were used during training:
106
- - learning_rate: 5e-05
107
- - train_batch_size: 32
108
  - eval_batch_size: 256
109
  - seed: 42
110
  - optimizer: Adafactor
@@ -115,32 +72,13 @@ The following hyperparameters were used during training:
115
 
116
  | Training Loss | Epoch | Step | Validation Loss | Accuracy Score | Precision | Recall | F1 |
117
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:---------:|:------:|:------:|
118
- | 0.1308 | 1.0 | 2066 | 0.1634 | 0.9414 | 0.7155 | 0.8278 | 0.7676 |
119
- | 0.0885 | 2.0 | 4132 | 0.1685 | 0.9438 | 0.7377 | 0.8168 | 0.7752 |
120
-
121
-
122
- ## Performance of the model in the training dataset
123
-
124
- ```
125
- precision recall f1-score support
126
- CELL 0.71 0.79 0.75 4948
127
- EXP_ASSAY 0.59 0.60 0.60 9885
128
- GENEPROD 0.79 0.89 0.84 21865
129
- ORGANISM 0.72 0.85 0.78 3464
130
- SMALL_MOLECULE 0.72 0.81 0.76 6431
131
- SUBCELLULAR 0.72 0.77 0.74 3850
132
- TISSUE 0.68 0.76 0.72 2975
133
-
134
- micro avg 0.72 0.80 0.76
135
- macro avg 0.70 0.78 0.74 53418
136
- weighted avg 0.72 0.80 0.76 53418
137
-
138
- {'test_loss': 0.16807569563388824, 'test_accuracy_score': 0.9427137503742414, 'test_precision': 0.7242540660382148, 'test_recall': 0.8011157287805608, 'test_f1': 0.7607484111817252, 'test_runtime': 88.1851, 'test_samples_per_second': 93.27, 'test_steps_per_second': 0.374}
139
- ```
140
 
141
  ### Framework versions
142
 
143
- - Transformers 4.15.0
144
  - Pytorch 1.11.0a0+bfe5ad2
145
  - Datasets 1.17.0
146
- - Tokenizers 0.10.3
 
1
  ---
2
+ license: mit
3
  tags:
4
  - generated_from_trainer
5
  datasets:
 
21
  metrics:
22
  - name: Precision
23
  type: precision
24
+ value: 0.8030010681183889
25
  - name: Recall
26
  type: recall
27
+ value: 0.837754771918473
28
  - name: F1
29
  type: f1
30
+ value: 0.8200098518700961
 
 
 
 
 
31
  ---
32
 
33
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
35
 
36
  # sd-ner-v2
37
 
38
+ This model is a fine-tuned version of [microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract) on the source_data_nlp dataset.
39
  It achieves the following results on the evaluation set:
40
+ - Loss: 0.1551
41
+ - Accuracy Score: 0.9513
42
+ - Precision: 0.8030
43
+ - Recall: 0.8378
44
+ - F1: 0.8200
45
 
46
  ## Model description
47
 
48
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
  ## Intended uses & limitations
51
 
52
+ More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Training and evaluation data
55
 
56
+ More information needed
57
 
58
  ## Training procedure
59
 
60
  ### Training hyperparameters
61
 
62
  The following hyperparameters were used during training:
63
+ - learning_rate: 0.0001
64
+ - train_batch_size: 64
65
  - eval_batch_size: 256
66
  - seed: 42
67
  - optimizer: Adafactor
 
72
 
73
  | Training Loss | Epoch | Step | Validation Loss | Accuracy Score | Precision | Recall | F1 |
74
  |:-------------:|:-----:|:----:|:---------------:|:--------------:|:---------:|:------:|:------:|
75
+ | 0.1082 | 1.0 | 785 | 0.1550 | 0.9493 | 0.7826 | 0.8402 | 0.8104 |
76
+ | 0.073 | 2.0 | 1570 | 0.1551 | 0.9513 | 0.8030 | 0.8378 | 0.8200 |
77
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  ### Framework versions
80
 
81
+ - Transformers 4.20.0
82
  - Pytorch 1.11.0a0+bfe5ad2
83
  - Datasets 1.17.0
84
+ - Tokenizers 0.12.1