finding-fossils
/

metaextractor

@@ -1,13 +1,17 @@
 ---
 tags:
 - Beta
-license: "mit"
-thumbnail: "https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png"
 widget:
-- text: "The core sample was aged at 12300 - 13500 BP and found at 210m a.s.l."
-  example_title: "Age/Alti"
-- text: "In Northern Canada, the BGC site core was primarily made up of Pinus pollen."
-  example_title: "Taxa/Site/Region"
 ---
 <img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
@@ -52,23 +56,13 @@ The entities detected by this model are:
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Bias, Risks, and Limitations
@@ -86,7 +80,41 @@ Users (both direct and downstream) should be made aware of the risks, biases and
 Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
@@ -94,7 +122,21 @@ Use the code below to get started with the model.
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
@@ -211,4 +253,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 ## Model Card Contact
-[More Information Needed]

 ---
 tags:
 - Beta
+license: mit
+thumbnail: >-
+  https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png
 widget:
+- text: The core sample was aged at 12300 - 13500 BP and found at 210m a.s.l.
+  example_title: Age/Alti
+- text: In Northern Canada, the BGC site core was primarily made up of Pinus pollen.
+  example_title: Taxa/Site/Region
+metrics:
+- precision
+- recall
 ---
 <img src="https://huggingface.co/finding-fossils/metaextractor/resolve/main/ffossils-logo-text.png" width="400">
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+This model can be used to extract entities from any text that are Paeleoecology related or tangential. Potential uses include identifying unique SITE names in research papers in other domains.
+### Direct Use
+This model is deployed on the xDD (formerly GeoDeepDive) servers where it is getting fed new research articles relevant to Neotoma and returning the extracted data.
+This approach could be adapted to other domains by using the training and development code found [github.com/NeotomaDB/MetaExtractor](https://github.com/NeotomaDB/MetaExtractor) to run similar data extraction for other research domains.
 ## Bias, Risks, and Limitations
 Use the code below to get started with the model.
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+from transformers import pipeline
+tokenizer = AutoTokenizer.from_pretrained("finding-fossils/metaextractor")
+model = AutoModelForTokenClassification.from_pretrained("finding-fossils/metaextractor")
+ner_pipe = pipeline('ner', model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+ner_pipe("In Northern Canada, the BGC site core was primarily made up of Pinus pollen.")
+# Output
+[
+  {
+    "entity_group": "REGION",
+    "score": 0.8088379502296448,
+    "word": " Northern Canada,",
+    "start": 3,
+    "end": 19
+  },
+  {
+    "entity_group": "SITE",
+    "score": 0.8307041525840759,
+    "word": " BGC",
+    "start": 24,
+    "end": 27
+  },
+  {
+    "entity_group": "TAXA",
+    "score": 0.9806344509124756,
+    "word": " Pinus",
+    "start": 63,
+    "end": 68
+  }
+]
+```
 ## Training Details
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+The model was trained using a set of 39 research articles deemed relevant to the Neotoma Database. All articles were written in English. The entities were labeled by the project team along with using pre-labelling with early models to speed up the labelling process.
+A 70/15/15 train/val/test split was used which had the following breakdown of words and entities.
+|   | Train | Validation | Test|
+|---|:---:|:---:|:---:|
+|Articles| 28 | 6 | 6|
+| Words | 220857 | 37809 | 36098 |
+|TAXA Entities | 3352 | 650 | 570 |
+|SITE Entities | 1228 | 177 | 219 |
+| REGION Entities | 2314 |  318 | 258 |
+|GEOG Entities | 188 | 37 | 8 |
+|AGE Entities | 919 | 206 | 153 |
+|ALTI Entities | 99 | 24 | 14 |
+| Email Entities | 14 | 4 | 11 |
 ### Training Procedure
 ## Model Card Contact
+[More Information Needed]