fadi77 commited on
Commit
c2b5a20
·
verified ·
1 Parent(s): 2cf7ec6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
3
+ # Doc / guide: https://huggingface.co/docs/hub/model-cards
4
+ {}
5
+ ---
6
+
7
+ # Model Card for Arabic PL-BERT Models
8
+
9
+ This model card describes a collection of three Arabic BERT models trained with different objectives and datasets for phoneme-aware language modeling.
10
+
11
+ ## Model Details
12
+
13
+ ### Model Description
14
+
15
+ These models are Arabic adaptations of the PL-BERT (Phoneme-aware Language BERT) approach introduced in [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810). The models incorporate phonemic information to enhance language understanding, with variations in training objectives and data preprocessing.
16
+
17
+ The collection includes three models:
18
+ - **mlm_p2g_non_diacritics**: Trained with both MLM (Masked Language Modeling) and P2G (Phoneme-to-Grapheme) objectives on non-diacritized Arabic text
19
+ - **mlm_only_non_diacritics**: Trained with only the MLM objective on non-diacritized Arabic text
20
+ - **mlm_only_with_diacritics**: Fine-tuned version of mlm_only_non_diacritics on diacritized Arabic text
21
+
22
+ - **Developed by:** [More Information Needed]
23
+ - **Model type:** Transformer-based language models (BERT variants)
24
+ - **Language:** Arabic
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model:** The mlm_only_with_diacritics model is fine-tuned from mlm_only_non_diacritics
27
+
28
+ ### Model Sources
29
+
30
+ - **Paper (PL-BERT approach):** [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810)
31
+
32
+ ## Training Details
33
+
34
+ ### Training Data
35
+
36
+ All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at [fadi77/wikipedia_20231101.ar.phonemized](https://huggingface.co/datasets/fadi77/wikipedia_20231101.ar.phonemized).
37
+
38
+ For the **mlm_only_with_diacritics** model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer ([Abjad AI, 2024](https://github.com/abjadai/catt)), introduced in [this paper](https://arxiv.org/abs/2407.03236).
39
+
40
+ ### Training Procedure
41
+
42
+ #### Model Architecture and Objectives
43
+
44
+ The models follow different training objectives:
45
+
46
+ 1. **mlm_p2g_non_diacritics**:
47
+ - Trained with dual objectives similar to the original PL-BERT:
48
+ - Masked Language Modeling (MLM): Standard BERT pre-training objective
49
+ - Phoneme-to-Grapheme (P2G): Predicting token IDs from phonemic representations
50
+ - Tokenization was performed using [aubmindlab/bert-base-arabertv2](https://huggingface.co/aubmindlab/bert-base-arabertv2), which uses subword tokenization
51
+ - Trained for 10 epochs on non-diacritized Wikipedia Arabic
52
+
53
+ 2. **mlm_only_non_diacritics**:
54
+ - Trained with only the MLM objective
55
+ - Removes the P2G objective, which according to ablation studies in the PL-BERT paper minimally affected performance
56
+ - This removal eliminated dependence on tokenization, which:
57
+ - Reduced the model size considerably (word/subword tokenization has a much larger vocabulary than phoneme vocabulary)
58
+ - Allowed phonemization of entire sentences at once, resulting in more accurate phonemization
59
+ - Trained on non-diacritized Wikipedia Arabic
60
+
61
+ 3. **mlm_only_with_diacritics**:
62
+ - Fine-tuned version of mlm_only_non_diacritics
63
+ - Trained for 10 epochs on diacritized Arabic text
64
+ - Uses the same MLM-only objective
65
+
66
+ ## Technical Considerations
67
+
68
+ ### Tokenization Challenges
69
+
70
+ For the **mlm_p2g_non_diacritics** model, a notable limitation was the use of subword tokenization. This approach is not ideal for pronunciation modeling because phonemizing parts of words independently loses the context of the word, which heavily affects pronunciation. The authors of the original PL-BERT paper used a word-level tokenizer for English, but a comparable high-quality word-level tokenizer was not available for Arabic. This limitation was addressed in the subsequent models by removing the P2G objective.
71
+
72
+ ### Diacritization
73
+
74
+ Arabic text can be written with or without diacritics (short vowel marks). The **mlm_only_with_diacritics** model specifically addresses this by training on fully diacritized text, which provides explicit pronunciation information that is typically absent in standard written Arabic.
75
+
76
+ ## Uses
77
+
78
+ ### Direct Use
79
+
80
+ These models can be used for Arabic natural language understanding tasks where phonemic awareness may be beneficial, such as:
81
+ - Speech recognition post-processing
82
+ - Text-to-speech preprocessing
83
+ - Dialect identification
84
+ - Pronunciation-sensitive applications
85
+
86
+ ### Downstream Use
87
+
88
+ The models can be fine-tuned for specific Arabic NLP tasks where pronunciation information might improve performance, such as:
89
+ - Dialect classification
90
+ - Speech-text alignment
91
+ - Pronunciation error detection in language learning applications
92
+
93
+ ## Bias, Risks, and Limitations
94
+
95
+ The models are trained on Wikipedia data, which may not represent all dialects and varieties of Arabic equally. The diacritization process, while state-of-the-art, may introduce some errors or biases in the training data.
96
+
97
+ The subword tokenization approach used in the mlm_p2g_non_diacritics model has limitations for phonemic modeling as noted above.
98
+
99
+ ## How to Get Started with the Model
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ [More Information Needed]
106
+
107
+ ## Model Card Contact
108
+
109
+ [More Information Needed]