File size: 5,462 Bytes
c2b5a20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47c6de7
 
0b4ba2b
c2b5a20
 
 
 
 
 
 
 
 
d126cd3
c2b5a20
8ca4f72
c2b5a20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34de743
c2b5a20
 
 
 
a556fa8
 
c2b5a20
 
34de743
c2b5a20
dcbf85b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{}
---

# Model Card for Arabic PL-BERT Models

This model card describes a collection of three Arabic BERT models trained with different objectives and datasets for phoneme-aware language modeling.

## Model Details

### Model Description

These models are Arabic adaptations of the PL-BERT (Phoneme-aware Language BERT) approach introduced in [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810). The models incorporate phonemic information to enhance language understanding, with variations in training objectives and data preprocessing.

The collection includes three models:
- **mlm_p2g_non_diacritics**: Trained with both MLM (Masked Language Modeling) and P2G (Phoneme-to-Grapheme) objectives on non-diacritized Arabic text
- **mlm_only_non_diacritics**: Trained with only the MLM objective on non-diacritized Arabic text
- **mlm_only_with_diacritics**: Fine-tuned version of mlm_only_non_diacritics on diacritized Arabic text

**Developed by:** Fadi (GitHub: Fadi987)  
**Model type:** Transformer-based language models (BERT variants)  
**Language:** Arabic

### Model Sources

- **Paper (PL-BERT approach):** [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810)

## Training Details

### Training Data

All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at [wikipedia.20231101.ar](https://huggingface.co/datasets/wikimedia/wikipedia/tree/main/20231101.ar).

For the **mlm_only_with_diacritics** model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer ([Abjad AI, 2024](https://github.com/abjadai/catt)), introduced in [this paper](https://arxiv.org/abs/2407.03236) and licensed under CC BY-NC 4.0.

### Training Procedure

#### Model Architecture and Objectives

The models follow different training objectives:

1. **mlm_p2g_non_diacritics**: 
   - Trained with dual objectives similar to the original PL-BERT:
     - Masked Language Modeling (MLM): Standard BERT pre-training objective
     - Phoneme-to-Grapheme (P2G): Predicting token IDs from phonemic representations
   - Tokenization was performed using [aubmindlab/bert-base-arabertv2](https://huggingface.co/aubmindlab/bert-base-arabertv2), which uses subword tokenization
   - Trained for 10 epochs on non-diacritized Wikipedia Arabic

2. **mlm_only_non_diacritics**:
   - Trained with only the MLM objective
   - Removes the P2G objective, which according to ablation studies in the PL-BERT paper minimally affected performance
   - This removal eliminated dependence on tokenization, which:
     - Reduced the model size considerably (word/subword tokenization has a much larger vocabulary than phoneme vocabulary)
     - Allowed phonemization of entire sentences at once, resulting in more accurate phonemization
   - Trained on non-diacritized Wikipedia Arabic

3. **mlm_only_with_diacritics**:
   - Fine-tuned version of mlm_only_non_diacritics
   - Trained for 10 epochs on diacritized Arabic text
   - Uses the same MLM-only objective

## Technical Considerations

### Tokenization Challenges

For the **mlm_p2g_non_diacritics** model, a notable limitation was the use of subword tokenization. This approach is not ideal for pronunciation modeling because phonemizing parts of words independently loses the context of the word, which heavily affects pronunciation. The authors of the original PL-BERT paper used a word-level tokenizer for English, but a comparable high-quality word-level tokenizer was not available for Arabic. This limitation was addressed in the subsequent models by removing the P2G objective.

### Diacritization

Arabic text can be written with or without diacritics (short vowel marks). The **mlm_only_with_diacritics** model specifically addresses this by training on fully diacritized text, which provides explicit pronunciation information that is typically absent in standard written Arabic.

## Uses

These models can be used for Arabic natural language understanding tasks where phonemic awareness may be beneficial, such as:
- Text-to-speech
- Speech recognition post-processing
- Dialect identification
- Pronunciation-sensitive applications

For examples on how these models can be used in code, take a look at: https://github.com/Fadi987/StyleTTS2/blob/main/Utils/PLBERT/util.py

## Bias, Risks, and Limitations

The models are trained on Wikipedia data, which may not represent all varieties of Arabic equally. The diacritization process, while state-of-the-art, may introduce some errors or biases in the training data.

The subword tokenization approach used in the mlm_p2g_non_diacritics model has limitations for phonemic modeling as noted above.

## Citation

**BibTeX:**
```bibtex
@article{catt2024,
  title={CATT: Character-based Arabic Tashkeel Transformer},
  author={Alasmary, Faris and Zaafarani, Orjuwan and Ghannam, Ahmad},
  journal={arXiv preprint arXiv:2407.03236},
  year={2024}
}

@article{plbert2023,
  title={Phoneme-Level BERT for Enhanced Prosody of Text-to-Speech with Grapheme Predictions},
  author={Li, Yinghao Aaron and Han, Cong and Jiang, Xilin and Mesgarani, Nima},
  journal={arXiv preprint arXiv:2301.08810},
  year={2023}
}
```