Update README.md
Browse files
README.md
CHANGED
|
@@ -17,9 +17,25 @@ This Electra model was trained on more than 8 billion tokens of Bosnian, Croatia
|
|
| 17 |
|
| 18 |
***new*** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
| 25 |
|
|
@@ -30,7 +46,7 @@ reldi-hr | Croatian | internet non-standard | - | 88.87 | 91.63 | **92.28*&a
|
|
| 30 |
SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
|
| 31 |
reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90*****
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
| 36 |
|
|
@@ -42,7 +58,7 @@ SETimes.SR | Serbian | standard | 84.64 | **92.41** | 92.28 | 92.02
|
|
| 42 |
reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92******
|
| 43 |
|
| 44 |
|
| 45 |
-
|
| 46 |
|
| 47 |
The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
|
| 48 |
|
|
@@ -55,7 +71,7 @@ mBERT | 42.25 | 82.05
|
|
| 55 |
cseBERT | 40.76 | 81.88
|
| 56 |
BERTić | **37.96** | **79.30**
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
|
| 61 |
|
|
|
|
| 17 |
|
| 18 |
***new*** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
|
| 19 |
|
| 20 |
+
If you use the model, please cite the following paper:
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
@inproceedings{ljubesic-lauc-2021-bertic,
|
| 24 |
+
title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
|
| 25 |
+
author = "Ljube{\v{s}}i{\'c}, Nikola and
|
| 26 |
+
Lauc, Davor",
|
| 27 |
+
booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
|
| 28 |
+
year = "2021",
|
| 29 |
+
address = "Kiev, Ukraine",
|
| 30 |
+
publisher = "Association for Computational Linguistics"
|
| 31 |
+
}
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Benchmarking
|
| 35 |
+
|
| 36 |
Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
|
| 37 |
|
| 38 |
+
### Part-of-speech tagging
|
| 39 |
|
| 40 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
| 41 |
|
|
|
|
| 46 |
SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
|
| 47 |
reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90*****
|
| 48 |
|
| 49 |
+
### Named entity recognition
|
| 50 |
|
| 51 |
Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).
|
| 52 |
|
|
|
|
| 58 |
reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92******
|
| 59 |
|
| 60 |
|
| 61 |
+
### Geolocation prediction
|
| 62 |
|
| 63 |
The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
|
| 64 |
|
|
|
|
| 71 |
cseBERT | 40.76 | 81.88
|
| 72 |
BERTić | **37.96** | **79.30**
|
| 73 |
|
| 74 |
+
### Choice Of Plausible Alternatives
|
| 75 |
|
| 76 |
The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
|
| 77 |
|