Commit
·
373cfeb
1
Parent(s):
5afd4b8
Update README.md
Browse files
README.md
CHANGED
|
@@ -56,8 +56,11 @@ token_classifier("'Helsingistä tuli Suomen suuriruhtinaskunnan pääkaupunki vu
|
|
| 56 |
## Training data
|
| 57 |
|
| 58 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
| 59 |
-
dataset were filtered out from the dataset used for training the model.
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
| 61 |
entity classes contained in training, validation and test datasets are listed below:
|
| 62 |
|
| 63 |
### Number of entity types in the data
|
|
@@ -67,6 +70,10 @@ Train|11691|30026|868|12999|7473|1184|14918|01360|1879|2068
|
|
| 67 |
Val|1542|4042|108|1654|879|160|1858|177|257|299
|
| 68 |
Test|1267|3698|86|1713|901|137|1843|174|233|260
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
## Training procedure
|
| 71 |
|
| 72 |
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
|
|
@@ -79,4 +86,9 @@ This model was trained using a NVIDIA RTX A6000 GPU with the following hyperpara
|
|
| 79 |
- maximum length of data sequence: 512
|
| 80 |
- patience: 2 epochs
|
| 81 |
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
## Training data
|
| 57 |
|
| 58 |
Some of the entities (for instance WORK_OF_ART, LAW, MONEY) that have been annotated in the [Turku OntoNotes Entities Corpus](https://github.com/TurkuNLP/turku-one)
|
| 59 |
+
dataset were filtered out from the dataset used for training the model.
|
| 60 |
+
|
| 61 |
+
In addition to this dataset, OCR'd and annotated content of
|
| 62 |
+
digitized documents from Finnish public administration was also used for model training.
|
| 63 |
+
The number of entities belonging to the different
|
| 64 |
entity classes contained in training, validation and test datasets are listed below:
|
| 65 |
|
| 66 |
### Number of entity types in the data
|
|
|
|
| 70 |
Val|1542|4042|108|1654|879|160|1858|177|257|299
|
| 71 |
Test|1267|3698|86|1713|901|137|1843|174|233|260
|
| 72 |
|
| 73 |
+
The annotation of the data was performed as a cooperation between the National Archives of Finland
|
| 74 |
+
and the [FIN-CLARIAH](https://www.kielipankki.fi/organization/fin-clariah/) research infrastructure
|
| 75 |
+
for Social Sciences and Humanities.
|
| 76 |
+
|
| 77 |
## Training procedure
|
| 78 |
|
| 79 |
This model was trained using a NVIDIA RTX A6000 GPU with the following hyperparameters:
|
|
|
|
| 86 |
- maximum length of data sequence: 512
|
| 87 |
- patience: 2 epochs
|
| 88 |
|
| 89 |
+
In the prerocessing stage, the input texts were split into chunks with a maximum length of 300 tokens,
|
| 90 |
+
in order to avoid the tokenized chunks exceeding the maximum length of 512. Tokenization was performed
|
| 91 |
+
using the tokenizer for the [bert-base-finnish-cased-v1](https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1)
|
| 92 |
+
model.
|
| 93 |
+
|
| 94 |
+
The training code with instructions will be available soon [here](https://github.com/DALAI-hanke/BERT_NER).
|