readerbench
/

RoBERT-small

@@ -8,25 +8,106 @@ language:
 # RoBERT-small
-## BERT small model for Romanian
 #### How to use
-TBC
 ## Training data
-TBC
-## Training procedure
-TBC
 ## Eval results
-TBC
 ### BibTeX entry and citation info

 # RoBERT-small
+## Pretrained BERT model for Romanian
+Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
+It was introduced in this [paper](https://www.blank.org/). Three BERT models were released: RoBERT-small, RoBERT-base and RoBERT-large, all versions uncased.
+Model   | Weights | L | H | A | MLM accuracy | NSP accuracy
+-------|---------|----------|----------|----------|----------|----------|
+RoBERT-small | 19M  | 12 | 256 | 8 | 0.5363 | 0.9687
+RoBERT-base  | 114M | 12 | 768 | 12 | 0.6511 | 0.9802
+RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843
+All models are available:
+* [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
+* [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
+* [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)
 #### How to use
+```python
+# tensorflow
+from transformers import AutoModel, AutoTokenizer, TFAutoModel
+tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
+model = TFAutoModel.from_pretrained("readerbench/RoBERT-small")
+inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
+outputs = model(inputs)
+# pytorch
+from transformers import AutoModel, AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
+model = AutoModel.from_pretrained("readerbench/RoBERT-small")
+inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
+outputs = model(**inputs)
+```
 ## Training data
+The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
+Corpus   | Words | Sentences | Size (GB)
+-------|---------|----------|----------|
+Oscar  | 1.78B | 87M | 10.8
+RoTex  | 240M | 14M | 1.5
+RoWiki | 50M | 2M | 0.3
+Total | 2.07B | 103M | 12.6
 ## Eval results
+### Sentiment analysis
+We report Macro-averaged F1 score (in %)
+Model   | Dev | Test
+-------|---------|----------
+multilingual-BERT  | 68.96 | 69.57
+XLM-R-base | 71.26 | 71.71
+[BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 70.49 | 71.02
+RoBERT-small | 66.32 | 66.37
+RoBERT-base | 70.89 | 71.61
+RoBERT-large | 72.48 | 72.11
+### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
+We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %)
+Model   | Dialect Classification | MD to RO | RO to MD
+-------|---------|----------
+2-CNN + SVM | 93.40 | 65.09 | 75.21
+Char+Word SVM | 96.20 | 69.08 | 81.93
+BiGRU | 93.30 | 70.10 | 80.30
+multilingual-BERT  | 95.34 | 68.76 | 78.24
+XLM-R-base | 96.28 | 69.93 | 8228
+[BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 96.20 | 69.93 | 78.79
+RoBERT-small | 95.67 | 69.01 | 80.40
+RoBERT-base | 97.39 | 68.30 | 81.09
+RoBERT-large | 97.78 | 69.91 | 83.65
+### Diacritics Restoration
+Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/).
+We report results on the official test set, as accuracies in %.
+Model   | word level | char level
+-------|---------|----------
+BiLSTM  | 99.42 | -
+CharCNN | 98.40 | 99.65
+CharCNN + multilingual-BERT | 99.72 | 99.94
+CharCNN + XLM-R-base | 99.76 | 99.95
+CharCNN + [BERT-base-ro](https://huggingface.co/dumitrescustefan/bert-base-romanian-uncased-v1) | 99.79 | 99.95
+CharCNN + RoBERT-small | 99.73 | 99.94
+CharCNN + RoBERT-base | 99.78 | 99.95
+CharCNN + RoBERT-large | 99.76 | 99.95
 ### BibTeX entry and citation info