SI2M-Lab
/

DarijaBERT

Moroccan Arabic

Model card Files Files and versions

Kamel commited on Oct 31, 2021

Commit

e71dd13

·

1 Parent(s): 331e57f

Update README.md

Files changed (1) hide show

README.md +15 -0

README.md CHANGED Viewed

@@ -1,3 +1,18 @@
 **DarijaBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
 The model was trained on a dataset issued from three different sources:

+---
+language: ar
+datasets:
+ - wikipedia
+ - OSIAN
+ - 1.5B Arabic Corpus
+ - OSCAR Arabic Unshuffled
+widget:
+ - text: " جاب ليا [MASK] ."
+---
 **DarijaBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
 The model was trained on a dataset issued from three different sources: