noorrtamerr
/

EGYbert

Mask Generation

egyptian-arabic

Model card Files Files and versions

noorrtamerr commited on Dec 26, 2024

Commit

69c7571

·

verified ·

1 Parent(s): d8a4238

Update README.md

Files changed (1) hide show

README.md +72 -1

README.md CHANGED Viewed

@@ -15,4 +15,75 @@ tags:
 - fine-tuned
 - arabert
 license: apache-2.0  # Add a license (choose one appropriate for your work)
----

 - fine-tuned
 - arabert
 license: apache-2.0  # Add a license (choose one appropriate for your work)
+---
+# EgBERT: Fine-Tuned AraBERT for Egyptian Arabic
+## Model Description
+EgBERT is a fine-tuned version of the pre-trained AraBERT model designed for Egyptian Arabic. This model was developed to enhance performance on tasks requiring understanding and generation of Egyptian dialect text, with a focus on Masked Language Modeling (MLM). The fine-tuning process involved a custom dataset containing colloquial Egyptian Arabic, making the model particularly suited for casual and conversational text.
+Key Features:
+- Based on **[aubmindlab/bert-base-arabert](https://huggingface.co/aubmindlab/bert-base-arabert)**.
+- Fine-tuned specifically for **Egyptian Arabic**.
+- Optimized for **Masked Language Modeling (MLM)** tasks.
+## Training Details
+- **Dataset**:
+  - A custom dataset of Egyptian Arabic collected from conversational text sources.
+  - Preprocessed to include common colloquial phrases and reduce noise in data.
+- **Training Setup**:
+  - Pre-trained model: `aubmindlab/bert-base-arabert`
+  - Fine-tuning performed for 3 epochs with a batch size of 16.
+  - Learning rate: 2e-5.
+  - MLM Probability: 15%.
+- **Tools**:
+  - **Hugging Face Transformers Library**
+  - **PyTorch**
+## Evaluation Results
+### Model Perplexity
+- **Baseline Model**: 36.2377
+- **Fine-Tuned Model**: 26.5359
+The fine-tuned model outperforms the baseline AraBERT model in terms of perplexity, indicating better performance on MLM tasks in Egyptian Arabic.
+## How to Use
+Here’s an example of how to use EgBERT in your project:
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+# Load the fine-tuned model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("noortamerr/EgBERT")
+model = AutoModelForMaskedLM.from_pretrained("noortamerr/EgBERT")
+# Input text with a masked token
+text = "الكورة في مصر [MASK] حاجة كل الناس بتتابعها."
+# Tokenize and predict
+inputs = tokenizer(text, return_tensors="pt")
+mask_token_index = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = outputs.logits
+# Decode the top 5 predictions for the [MASK] token
+mask_token_logits = predictions[0, mask_token_index, :]
+top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
+predicted_words = [tokenizer.decode([token]) for token in top_5_tokens]
+print(f"Predicted words: {predicted_words}")
+@misc{EgBERT,
+  author = {Noor Tamer, Roba Mahmoud, Orchid Hazem},
+  title = {EgBERT: Fine-Tuned AraBERT for Egyptian Arabic},
+  year = {2024},
+  publisher = {Hugging Face},
+  url = {https://huggingface.co/noortamerr/EgBERT}
+}