Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - sk
5
+ tags:
6
+ - generated_from_trainer
7
+ datasets:
8
+ - NBS_sentence
9
+ metrics:
10
+ - precision
11
+ - recall
12
+ - f1
13
+ - accuracy
14
+ inference: false
15
+ model-index:
16
+ - name: slovakbert-ner-v2
17
+ results:
18
+ - task:
19
+ name: Token Classification
20
+ type: token-classification
21
+ metrics:
22
+ - name: Precision
23
+ type: precision
24
+ value: 0.9715
25
+ - name: Recall
26
+ type: recall
27
+ value: 0.9433
28
+ - name: F1
29
+ type: f1
30
+ value: 0.9547
31
+ - name: Accuracy
32
+ type: accuracy
33
+ value: 0.9897
34
  ---
35
+
36
+ # SlovakBERT based Named Entity Recognition
37
+
38
+ Deep Learning model developed for Named Entity Recognition (NER) in Slovak. The [**Gerulata/SlovakBERT**](https://huggingface.co/gerulata/slovakbert) based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: **PERSON**, **ORGANIZATION**, **LOCATION**, **DATE**, **TIME**, **MONEY** and **PERCENTAGE**
39
+
40
+ ### **Related Work**
41
+ [![Thesis][Thesis]][Thesis-url]
42
+
43
+ ## Model usage
44
+
45
+ ### Simple Named Entity Recognition (NER)
46
+ ```python
47
+ from transformers import pipeline
48
+
49
+ ner_pipeline = pipeline(task='ner', model='Raychani1/slovakbert-ner-v2')
50
+ input_sentence = 'Hoci podľa ostatných údajov NBS pre Bratislavský kraj je aktuálna priemerná cena nehnuteľností na úrovni 2 072 eur za štvorcový meter, ceny bytov v hlavnom meste sú podstatne vyššie.'
51
+ classifications = ner_pipeline(input_sentence)
52
+ ```
53
+
54
+ ### Named Entity Recognition (NER) with Visualization
55
+ For a Visualization Example please refer to the following [Gist](https://gist.github.com/Raychani1/7d4455491f0aa681ed8ea99d8b1d8279).
56
+
57
+ ### Model Prediction Output Example
58
+
59
+ ![prediction_output](https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP/assets/45550552/723ab7f1-4efb-4d03-87d6-b9ac1e40990f)
60
+
61
+ ## Model Training
62
+
63
+ ### Training Hyperparameters
64
+
65
+ | **Parameter** | **Value** |
66
+ |:---------------------------:|:---------:|
67
+ | per_device_train_batch_size | 4 |
68
+ | per_device_eval_batch_size | 4 |
69
+ | learning_rate | 5e-05 |
70
+ | adam_beta1 | 0.9 |
71
+ | adam_beta1 | 0.999 |
72
+ | adam_epsilon | 1e-08 |
73
+ | num_train_epochs | 15 |
74
+ | lr_scheduler_type | linear |
75
+ | seed | 42 |
76
+
77
+ ### Training results
78
+ Best model results are reached in the 8th training epoch.
79
+
80
+ | Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
81
+ |:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
82
+ | 0.6721 | 1.0 | 70 | 0.2214 | 0.6972 | 0.7308 | 0.7136 | 0.9324 |
83
+ | 0.1849 | 2.0 | 140 | 0.1697 | 0.8056 | 0.8365 | 0.8208 | 0.952 |
84
+ | 0.0968 | 3.0 | 210 | 0.1213 | 0.882 | 0.8622 | 0.872 | 0.9728 |
85
+ | 0.0468 | 4.0 | 280 | 0.1107 | 0.8372 | 0.907 | 0.8708 | 0.9684 |
86
+ | 0.0415 | 5.0 | 350 | 0.1644 | 0.8059 | 0.8782 | 0.8405 | 0.9615 |
87
+ | 0.0233 | 6.0 | 420 | 0.1255 | 0.8576 | 0.8878 | 0.8724 | 0.9716 |
88
+ | 0.0198 | 7.0 | 490 | 0.1383 | 0.8545 | 0.8846 | 0.8693 | 0.9703 |
89
+ | 0.0133 | 8.0 | 560 | 0.1241 | 0.884 | 0.9038 | 0.8938 | 0.9735 |
90
+
91
+
92
+ ## Model Evaluation
93
+
94
+ ### Evaluation Dataset Distribution
95
+
96
+ | **NER Tag** | **Number of Tokens** |
97
+ |:-----------------:|:--------------------:|
98
+ | **0** | 6568 |
99
+ | **B-Person** | 96 |
100
+ | **I-Person** | 83 |
101
+ | **B-Organizaton** | 583 |
102
+ | **I-Organizaton** | 585 |
103
+ | **B-Location** | 59 |
104
+ | **I-Location** | 15 |
105
+ | **B-Date** | 113 |
106
+ | **I-Date** | 87 |
107
+ | **Time** | 5 |
108
+ | **B-Money** | 44 |
109
+ | **I-Money** | 74 |
110
+ | **B-Percentage** | 57 |
111
+ | **I-Percentage** | 54 |
112
+
113
+ ### Evaluation Confusion Matrix
114
+
115
+ ![image](https://github.com/Raychani1/Text_Parsing_Methods_Using_NLP/assets/45550552/e6d1a1c6-e02f-4de9-9684-5882a405d31f)
116
+
117
+ ### Evaluation Model Metrics
118
+
119
+ | **Precision** | **Macro-Precision** | **Recall** | **Macro-Recall** | **F1** | **Macro-F1** | **Accuracy** |
120
+ |:-------------:|:-------------------:|:----------:|:----------------:|:------:|:------------:|:------------:|
121
+ | 0.9897 | 0.9715 | 0.9897 | 0.9433 | 0.9895 | 0.9547 | 0.9897 |
122
+
123
+ ## Framework Versions
124
+
125
+ - Transformers 4.26.1
126
+ - PyTorch 1.13.1
127
+ - Tokenizers 0.13.2
128
+
129
+ <!-- Variables -->
130
+
131
+ [Thesis]: https://img.shields.io/badge/%F0%9F%93%9C-Masters%20Thesis-blue?style=for-the-badge
132
+ [Thesis-url]: https://opac.crzp.sk/?fn=detailBiblioForm&sid=C0DEB8E07572332BA2230915805F