fadi77
/

pl-bert

fadi77 commited on Apr 15, 2025

Commit

8ca4f72

verified ·

1 Parent(s): 8f5d729

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ The collection includes three models:
 All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at [fadi77/wikipedia_20231101.ar.phonemized](https://huggingface.co/datasets/fadi77/wikipedia_20231101.ar.phonemized).
-For the **mlm_only_with_diacritics** model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer ([Abjad AI, 2024](https://github.com/abjadai/catt)), introduced in [this paper](https://arxiv.org/abs/2407.03236).
 ### Training Procedure

 All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at [fadi77/wikipedia_20231101.ar.phonemized](https://huggingface.co/datasets/fadi77/wikipedia_20231101.ar.phonemized).
+For the **mlm_only_with_diacritics** model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer ([Abjad AI, 2024](https://github.com/abjadai/catt)), introduced in [this paper](https://arxiv.org/abs/2407.03236) and licensed under CC BY-NC 4.0.
 ### Training Procedure