fadi77
/

pl-bert

Model card Files Files and versions

xet

Community

fadi77 commited on Mar 27, 2025

Commit

c2b5a20

verified ·

1 Parent(s): 2cf7ec6

Create README.md

Browse files

Files changed (1) hide show

README.md +109 -0

README.md ADDED Viewed

	@@ -0,0 +1,109 @@

+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+# Model Card for Arabic PL-BERT Models
+This model card describes a collection of three Arabic BERT models trained with different objectives and datasets for phoneme-aware language modeling.
+## Model Details
+### Model Description
+These models are Arabic adaptations of the PL-BERT (Phoneme-aware Language BERT) approach introduced in [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810). The models incorporate phonemic information to enhance language understanding, with variations in training objectives and data preprocessing.
+The collection includes three models:
+- **mlm_p2g_non_diacritics**: Trained with both MLM (Masked Language Modeling) and P2G (Phoneme-to-Grapheme) objectives on non-diacritized Arabic text
+- **mlm_only_non_diacritics**: Trained with only the MLM objective on non-diacritized Arabic text
+- **mlm_only_with_diacritics**: Fine-tuned version of mlm_only_non_diacritics on diacritized Arabic text
+- **Developed by:** [More Information Needed]
+- **Model type:** Transformer-based language models (BERT variants)
+- **Language:** Arabic
+- **License:** [More Information Needed]
+- **Finetuned from model:** The mlm_only_with_diacritics model is fine-tuned from mlm_only_non_diacritics
+### Model Sources
+- **Paper (PL-BERT approach):** [Ashby et al. (2023)](https://arxiv.org/pdf/2301.08810)
+## Training Details
+### Training Data
+All models were initially trained on a cleaned version of the Arabic Wikipedia dataset. The dataset is available at [fadi77/wikipedia_20231101.ar.phonemized](https://huggingface.co/datasets/fadi77/wikipedia_20231101.ar.phonemized).
+For the **mlm_only_with_diacritics** model, a random sample of 200,000 entries (out of approximately 1.2 million) was selected from the Wikipedia Arabic dataset and fully diacritized using the state-of-the-art CATT diacritizer ([Abjad AI, 2024](https://github.com/abjadai/catt)), introduced in [this paper](https://arxiv.org/abs/2407.03236).
+### Training Procedure
+#### Model Architecture and Objectives
+The models follow different training objectives:
+1. **mlm_p2g_non_diacritics**:
+   - Trained with dual objectives similar to the original PL-BERT:
+     - Masked Language Modeling (MLM): Standard BERT pre-training objective
+     - Phoneme-to-Grapheme (P2G): Predicting token IDs from phonemic representations
+   - Tokenization was performed using [aubmindlab/bert-base-arabertv2](https://huggingface.co/aubmindlab/bert-base-arabertv2), which uses subword tokenization
+   - Trained for 10 epochs on non-diacritized Wikipedia Arabic
+2. **mlm_only_non_diacritics**:
+   - Trained with only the MLM objective
+   - Removes the P2G objective, which according to ablation studies in the PL-BERT paper minimally affected performance
+   - This removal eliminated dependence on tokenization, which:
+     - Reduced the model size considerably (word/subword tokenization has a much larger vocabulary than phoneme vocabulary)
+     - Allowed phonemization of entire sentences at once, resulting in more accurate phonemization
+   - Trained on non-diacritized Wikipedia Arabic
+3. **mlm_only_with_diacritics**:
+   - Fine-tuned version of mlm_only_non_diacritics
+   - Trained for 10 epochs on diacritized Arabic text
+   - Uses the same MLM-only objective
+## Technical Considerations
+### Tokenization Challenges
+For the **mlm_p2g_non_diacritics** model, a notable limitation was the use of subword tokenization. This approach is not ideal for pronunciation modeling because phonemizing parts of words independently loses the context of the word, which heavily affects pronunciation. The authors of the original PL-BERT paper used a word-level tokenizer for English, but a comparable high-quality word-level tokenizer was not available for Arabic. This limitation was addressed in the subsequent models by removing the P2G objective.
+### Diacritization
+Arabic text can be written with or without diacritics (short vowel marks). The **mlm_only_with_diacritics** model specifically addresses this by training on fully diacritized text, which provides explicit pronunciation information that is typically absent in standard written Arabic.
+## Uses
+### Direct Use
+These models can be used for Arabic natural language understanding tasks where phonemic awareness may be beneficial, such as:
+- Speech recognition post-processing
+- Text-to-speech preprocessing
+- Dialect identification
+- Pronunciation-sensitive applications
+### Downstream Use
+The models can be fine-tuned for specific Arabic NLP tasks where pronunciation information might improve performance, such as:
+- Dialect classification
+- Speech-text alignment
+- Pronunciation error detection in language learning applications
+## Bias, Risks, and Limitations
+The models are trained on Wikipedia data, which may not represent all dialects and varieties of Arabic equally. The diacritization process, while state-of-the-art, may introduce some errors or biases in the training data.
+The subword tokenization approach used in the mlm_p2g_non_diacritics model has limitations for phonemic modeling as noted above.
+## How to Get Started with the Model
+[More Information Needed]
+## Evaluation
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]