sarnoult commited on
Commit
7497942
·
verified ·
1 Parent(s): 540d018

Upload model and README

Browse files
Files changed (2) hide show
  1. README.md +106 -3
  2. pytorch.bin +3 -0
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - nl
5
+ base_model:
6
+ - FacebookAI/xlm-roberta-base
7
+ tags:
8
+ - digital-humanities
9
+ - token-classification
10
+ ---
11
+
12
+ # Model Card for NER-base
13
+
14
+ [Globalise](https://globalise.huygens.knaw.nl/) NER token-classification model, development version.
15
+
16
+ ## Model Details
17
+
18
+ ### Model Description
19
+
20
+ This is the first version of a NER model developed for the Globalise project.
21
+
22
+ - **Developed by:** Sophie Arnoult
23
+ - **Shared by:** Globalise Team
24
+ - **Funded by:** NWO
25
+ - **Model type:** token classification
26
+
27
+ ## Uses
28
+
29
+ Named-Entity tagging of historical (17th-18th century), VOC-related Dutch documents.
30
+
31
+
32
+ ## Bias, Risks, and Limitations
33
+
34
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
35
+ The texts the model was fine-tuned on are heavily biased, representing colonial standpoints. While care has been taken in designing the labelset and annotating the data, biases may remain when applying the model on similar data; the model has not been tested on other data.
36
+
37
+ This is a development version. The training and development data consist of [VOC missives](https://research.vu.nl/en/datasets/voc-gm-ner-corpus) data enriched with new annotations. Most entity types used in Globalise are not present in the VOC missives data, while the new annotations are limited in number. Performance on these may therefore not be representative.
38
+
39
+
40
+ ## Training Details
41
+
42
+ ### Training Data
43
+
44
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
45
+
46
+ The training and development data consist of
47
+ - GM NER corpus ([datasplit-all-standard](https://data.yoda.vu.nl:9443/vault-fgw-llc-vocmissives/voc_gm_ner%5B1670857835%5D/original/datasplit_all_standard/), train/dev data), where labels are mapped to their Globalise equivalents
48
+ - Globalise annotated data (first set of annotations, to be extended and published at a later date)
49
+
50
+ The data are pretokenized with [Spacy](https://spacy.io/models/nl#nl_core_news_lg). Sequences are split at 240 word tokens.
51
+
52
+ ### Training Procedure
53
+
54
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
55
+
56
+ #### Training Hyperparameters
57
+
58
+ - **Training regime:** fp32
59
+ - **Optimizer:** Adam, learning rate 3e-5
60
+ - **max-sequence-length:** 512
61
+ - **batch size:** 32
62
+ - **max-epochs**: 20
63
+
64
+
65
+ ## Evaluation
66
+
67
+ Model selected based on validation weighted multiclass F1 score, using a single seed.
68
+ <!-- This section describes the evaluation protocols and provides the results. -->
69
+
70
+
71
+ ### Results
72
+ label | precision | recall | f1-score | support |
73
+ | --- | --- | --- |--- | --- |
74
+ CMTY_NAME| 0.72| 0.80| 0.76| 109
75
+ CMTY_QUAL| 1.00| 0.67| 0.80| 9
76
+ CMTY_QUANT| 0.76| 0.85| 0.80| 66
77
+ DATE| 0.48| 0.53| 0.51| 43
78
+ DOC| 0.61| 0.55| 0.58| 20
79
+ ETH_REL| 0.78| 0.81| 0.79| 31
80
+ LOC_ADJ| 0.91| 0.96| 0.94| 464
81
+ LOC_NAME| 0.91| 0.94| 0.92| 1324
82
+ ORG| 0.92| 0.87| 0.89| 265
83
+ PER_ATTR| 0.69| 0.82| 0.75| 44
84
+ PER_NAME| 0.80| 0.87| 0.83| 613
85
+ PRF| 0.70| 0.76| 0.73| 97
86
+ SHIP| 0.89| 0.86| 0.87| 519
87
+ SHIP_TYPE| 0.79| 0.82| 0.81| 33
88
+ STATUS| 0.96| 0.96| 0.96| 27
89
+ micro avg | 0.86 | 0.89 | 0.88 | 3664
90
+ macro avg | 0.79 | 0.80 | 0.80 | 3664
91
+ weighted avg | 0.86 | 0.89 | 0.88 | 3664
92
+
93
+
94
+ ## Technical Specifications
95
+
96
+
97
+ ### Compute Infrastructure
98
+
99
+ SURF Snellius
100
+
101
+ #### Hardware
102
+
103
+ A100
104
+
105
+
106
+
pytorch.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:efb8c129602acfe0fe243d75de20791f073122507edbd585729ce6902a4b69f2
3
+ size 1109989028