Commit
·
ad091d5
1
Parent(s):
994b162
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,13 +1,17 @@
|
|
| 1 |
---
|
| 2 |
tags:
|
| 3 |
- Beta
|
| 4 |
-
license:
|
| 5 |
-
thumbnail:
|
|
|
|
| 6 |
widget:
|
| 7 |
-
- text:
|
| 8 |
-
example_title:
|
| 9 |
-
- text:
|
| 10 |
-
example_title:
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
<img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
|
|
@@ -52,23 +56,13 @@ The entities detected by this model are:
|
|
| 52 |
|
| 53 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 54 |
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
| 58 |
-
|
| 59 |
-
[More Information Needed]
|
| 60 |
-
|
| 61 |
-
### Downstream Use [optional]
|
| 62 |
-
|
| 63 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
| 64 |
-
|
| 65 |
-
[More Information Needed]
|
| 66 |
|
| 67 |
-
###
|
| 68 |
|
| 69 |
-
|
| 70 |
|
| 71 |
-
[
|
| 72 |
|
| 73 |
## Bias, Risks, and Limitations
|
| 74 |
|
|
@@ -86,7 +80,41 @@ Users (both direct and downstream) should be made aware of the risks, biases and
|
|
| 86 |
|
| 87 |
Use the code below to get started with the model.
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
## Training Details
|
| 92 |
|
|
@@ -94,7 +122,21 @@ Use the code below to get started with the model.
|
|
| 94 |
|
| 95 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
### Training Procedure
|
| 100 |
|
|
@@ -211,4 +253,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 211 |
|
| 212 |
## Model Card Contact
|
| 213 |
|
| 214 |
-
[More Information Needed]
|
|
|
|
| 1 |
---
|
| 2 |
tags:
|
| 3 |
- Beta
|
| 4 |
+
license: mit
|
| 5 |
+
thumbnail: >-
|
| 6 |
+
https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png
|
| 7 |
widget:
|
| 8 |
+
- text: The core sample was aged at 12300 - 13500 BP and found at 210m a.s.l.
|
| 9 |
+
example_title: Age/Alti
|
| 10 |
+
- text: In Northern Canada, the BGC site core was primarily made up of Pinus pollen.
|
| 11 |
+
example_title: Taxa/Site/Region
|
| 12 |
+
metrics:
|
| 13 |
+
- precision
|
| 14 |
+
- recall
|
| 15 |
---
|
| 16 |
|
| 17 |
<img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
|
|
|
|
| 56 |
|
| 57 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
| 58 |
|
| 59 |
+
This model can be used to extract entities from any text that are Paeleoecology related or tangential. Potential uses include identifying unique SITE names in research papers in other domains.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
+
### Direct Use
|
| 62 |
|
| 63 |
+
This model is deployed on the xDD (formerly GeoDeepDive) servers where it is getting fed new research articles relevant to Neotoma and returning the extracted data.
|
| 64 |
|
| 65 |
+
This approach could be adapted to other domains by using the training and development code found [github.com/NeotomaDB/MetaExtractor](https://github.com/NeotomaDB/MetaExtractor) to run similar data extraction for other research domains.
|
| 66 |
|
| 67 |
## Bias, Risks, and Limitations
|
| 68 |
|
|
|
|
| 80 |
|
| 81 |
Use the code below to get started with the model.
|
| 82 |
|
| 83 |
+
```python
|
| 84 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 85 |
+
from transformers import pipeline
|
| 86 |
+
|
| 87 |
+
tokenizer = AutoTokenizer.from_pretrained("finding-fossils/metaextractor")
|
| 88 |
+
model = AutoModelForTokenClassification.from_pretrained("finding-fossils/metaextractor")
|
| 89 |
+
ner_pipe = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
|
| 90 |
+
|
| 91 |
+
ner_pipe("In Northern Canada, the BGC site core was primarily made up of Pinus pollen.")
|
| 92 |
+
|
| 93 |
+
# Output
|
| 94 |
+
[
|
| 95 |
+
{
|
| 96 |
+
"entity_group": "REGION",
|
| 97 |
+
"score": 0.8088379502296448,
|
| 98 |
+
"word": " Northern Canada,",
|
| 99 |
+
"start": 3,
|
| 100 |
+
"end": 19
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"entity_group": "SITE",
|
| 104 |
+
"score": 0.8307041525840759,
|
| 105 |
+
"word": " BGC",
|
| 106 |
+
"start": 24,
|
| 107 |
+
"end": 27
|
| 108 |
+
},
|
| 109 |
+
{
|
| 110 |
+
"entity_group": "TAXA",
|
| 111 |
+
"score": 0.9806344509124756,
|
| 112 |
+
"word": " Pinus",
|
| 113 |
+
"start": 63,
|
| 114 |
+
"end": 68
|
| 115 |
+
}
|
| 116 |
+
]
|
| 117 |
+
```
|
| 118 |
|
| 119 |
## Training Details
|
| 120 |
|
|
|
|
| 122 |
|
| 123 |
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
| 124 |
|
| 125 |
+
The model was trained using a set of 39 research articles deemed relevant to the Neotoma Database. All articles were written in English. The entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
|
| 126 |
+
|
| 127 |
+
A 70/15/15 train/val/test split was used which had the following breakdown of words and entities.
|
| 128 |
+
|
| 129 |
+
| | Train | Validation | Test|
|
| 130 |
+
|---|:---:|:---:|:---:|
|
| 131 |
+
|Articles| 28 | 6 | 6|
|
| 132 |
+
| Words | 220857 | 37809 | 36098 |
|
| 133 |
+
|TAXA Entities | 3352 | 650 | 570 |
|
| 134 |
+
|SITE Entities | 1228 | 177 | 219 |
|
| 135 |
+
| REGION Entities | 2314 | 318 | 258 |
|
| 136 |
+
|GEOG Entities | 188 | 37 | 8 |
|
| 137 |
+
|AGE Entities | 919 | 206 | 153 |
|
| 138 |
+
|ALTI Entities | 99 | 24 | 14 |
|
| 139 |
+
| Email Entities | 14 | 4 | 11 |
|
| 140 |
|
| 141 |
### Training Procedure
|
| 142 |
|
|
|
|
| 253 |
|
| 254 |
## Model Card Contact
|
| 255 |
|
| 256 |
+
[More Information Needed]
|