SlovakBERT based Named Entity Recognition

Deep Learning model developed for Named Entity Recognition (NER) in Slovak. The Gerulata/SlovakBERT based model is fine-tuned on webscraped Slovak news articles. The finished model supports the following IOB tagged entity categories: PERSON, ORGANIZATION, LOCATION, DATE, TIME, MONEY and PERCENTAGE

Related Work

Model usage

Simple Named Entity Recognition (NER)

from transformers import pipeline

ner_pipeline = pipeline(task='ner', model='Raychani1/slovakbert-ner-v2')
input_sentence = 'Hoci podľa ostatných údajov NBS pre Bratislavský kraj je aktuálna priemerná cena nehnuteľností na úrovni 2 072 eur za štvorcový meter, ceny bytov v hlavnom meste sú podstatne vyššie.'
classifications = ner_pipeline(input_sentence)

Named Entity Recognition (NER) with Visualization

For a Visualization Example please refer to the following Gist.

Model Prediction Output Example

Model Training

Training Hyperparameters

Parameter	Value
per_device_train_batch_size	4
per_device_eval_batch_size	4
learning_rate	5e-05
adam_beta1	0.9
adam_beta1	0.999
adam_epsilon	1e-08
num_train_epochs	15
lr_scheduler_type	linear
seed	42

Training results

Best model results are reached in the 8th training epoch.

Training Loss	Epoch	Step	Validation Loss	Precision	Recall	F1	Accuracy
0.6721	1.0	70	0.2214	0.6972	0.7308	0.7136	0.9324
0.1849	2.0	140	0.1697	0.8056	0.8365	0.8208	0.952
0.0968	3.0	210	0.1213	0.882	0.8622	0.872	0.9728
0.0468	4.0	280	0.1107	0.8372	0.907	0.8708	0.9684
0.0415	5.0	350	0.1644	0.8059	0.8782	0.8405	0.9615
0.0233	6.0	420	0.1255	0.8576	0.8878	0.8724	0.9716
0.0198	7.0	490	0.1383	0.8545	0.8846	0.8693	0.9703
0.0133	8.0	560	0.1241	0.884	0.9038	0.8938	0.9735

Model Evaluation

Evaluation Dataset Distribution

NER Tag	Number of Tokens
0	6568
B-Person	96
I-Person	83
B-Organizaton	583
I-Organizaton	585
B-Location	59
I-Location	15
B-Date	113
I-Date	87
Time	5
B-Money	44
I-Money	74
B-Percentage	57
I-Percentage	54

Evaluation Confusion Matrix

Evaluation Model Metrics

Precision	Macro-Precision	Recall	Macro-Recall	F1	Macro-F1	Accuracy
0.9897	0.9715	0.9897	0.9433	0.9895	0.9547	0.9897

Framework Versions

Transformers 4.26.1
PyTorch 1.13.1
Tokenizers 0.13.2

Downloads last month: 22

Evaluation results

Precision
self-reported

0.972
Recall
self-reported

0.943
F1
self-reported

0.955
Accuracy
self-reported

0.990