SI2M-Lab
/

DarijaBERT

Moroccan Arabic

Model card Files Files and versions

Kamel commited on Oct 31, 2021

Commit

55fc5d7

·

1 Parent(s): 6de7b49

Create README.md

Files changed (1) hide show

README.md +24 -0

README.md ADDED Viewed

	@@ -0,0 +1,24 @@

+**DBERT** is the first BERT model for the Moroccan Arabic dialect called “Darija”. It is based on the same architecture as BERT-base, but without the Next Sentence Prediction (NSP) objective. This model was trained on a total of ~3 Million sequences of Darija dialect representing 691MB of text or a total of ~100M tokens.
+The model was trained on a dataset issued from three different sources:
+*  Stories written in Darija scrapped from a dedicated website
+*  Youtube comments from 40 different Moroccan channels
+*  Tweets crawled based on a list of Darija keywords.
+More details about DarijaBert are available in the dedicated GitHub repository
+**Loading the model**
+The model can be loaded directly using the Huggingface library:
+```python
+from transformers import AutoTokenizer, AutoModel
+DBERT_tokenizer = AutoTokenizer.from_pretrained("Kamel/DBERT")
+DBERT_Bert_model = AutoModel.from_pretrained("Kamel/DBERT")
+```
+**Acknowledgments**
+We gratefully acknowledge Google’s TensorFlow Research Cloud (TRC) program for providing us with free Cloud TPUs.