|
|
--- |
|
|
language: ar |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- sentiment-analysis |
|
|
- arabic |
|
|
- marbert |
|
|
- twitter |
|
|
- text-classification |
|
|
datasets: |
|
|
- mksaad/arabic-sentiment-twitter-corpus |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
--- |
|
|
|
|
|
# MARBERT Model for Arabic Sentiment Analysis (Positive/Negative) |
|
|
|
|
|
This is a fine-tuned version of `UBC-NLP/MARBERTv2` for Arabic Sentiment Analysis. |
|
|
The model is trained to classify Arabic text (specifically tweets) into two categories: **Positive (`LABEL_1`)** or **Negative (`LABEL_0`)**. |
|
|
|
|
|
## 🚀 Live Demo |
|
|
|
|
|
You can test the model live on the Hugging Face Space: |
|
|
**[https://huggingface.co/spaces/iMeshal/arabic-sentiment-app](https://huggingface.co/spaces/iMeshal/arabic-sentiment-app)** |
|
|
|
|
|
--- |
|
|
|
|
|
## 📊 Model Performance |
|
|
|
|
|
The model was trained on 80% of the training data and validated on 20%. The final evaluation was performed on a separate, unseen test set. |
|
|
|
|
|
**Final Test Set Results (Accuracy: 94.40%)** |
|
|
|
|
|
| Metric | Score | |
|
|
| :--- | :---: | |
|
|
| **Accuracy** | **94.40%** | |
|
|
| F1 (Macro) | 94.40% | |
|
|
| Precision (Macro) | 94.40% | |
|
|
| Recall (Macro) | 94.40% | |
|
|
| Loss | 0.1667 | |
|
|
|
|
|
The model achieved its best validation accuracy of **93.4%** at Epoch 2, and `load_best_model_at_end` was used. |
|
|
|
|
|
--- |
|
|
|
|
|
## 💻 Intended Use (How to use) |
|
|
|
|
|
You can use this model directly with the `transformers` pipeline. |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the pipeline |
|
|
pipe = pipeline( |
|
|
"sentiment-analysis", |
|
|
model="iMeshal/arabic-sentiment-classifier-marbert" |
|
|
) |
|
|
|
|
|
# Test with new texts |
|
|
texts = [ |
|
|
"هذا المنتج رائع جداً أنصح به", |
|
|
"أسوأ خدمة عملاء على الإطلاق", |
|
|
"الجو اليوم جميل" |
|
|
] |
|
|
|
|
|
results = pipe(texts) |
|
|
print(results) |
|
|
# Output: |
|
|
# [ |
|
|
# {'label': 'LABEL_1', 'score': 0.99...}, # Positive |
|
|
# {'label': 'LABEL_0', 'score': 0.99...}, # Negative |
|
|
# {'label': 'LABEL_1', 'score': 0.98...} # Positive |
|
|
# ] |
|
|
|
|
|
``` |
|
|
|
|
|
## 📚 Training Data |
|
|
|
|
|
The model was trained on the **[Arabic Sentiment Twitter Corpus](https://www.kaggle.com/datasets/mksaad/arabic-sentiment-twitter-corpus)** dataset from Kaggle. |
|
|
|
|
|
* **Preprocessing:** Long/concatenated tweets (which appeared to be noise) were cleaned. |
|
|
* **Training Set:** ~24,163 samples. |
|
|
* **Validation Set:** ~6,041 samples. |
|
|
* **Test Set:** ~11,508 samples. |
|
|
* **Balance:** All datasets were perfectly balanced (approx. 50% Positive / 50% Negative). |
|
|
|
|
|
--- |
|
|
|
|
|
## ⚙️ Training Procedure |
|
|
|
|
|
The model was trained using the `transformers.Trainer` class with the following key hyperparameters: |
|
|
|
|
|
* **Framework:** PyTorch |
|
|
* **Base Model:** `UBC-NLP/MARBERTv2` |
|
|
* **Epochs:** 3 (with Early Stopping) |
|
|
* **Early Stopping:** Patience set to 2 (training stopped at Epoch 3, but Epoch 2 was the best). |
|
|
* **Batch Size:** 16 |
|
|
* **Learning Rate:** 2e-5 |
|
|
* **Tokenizer:** `AutoTokenizer` (with `padding="max_length"`, `truncation=True`, `max_length=512`) |
|
|
|
|
|
--- |
|
|
|
|
|
### 📞 Contact |
|
|
|
|
|
* **Name:** Meshal AL-Qushaym |
|
|
* **Email:** meshalqushim@outlook.com |
|
|
* **Kaggle:** [kaggle.com/meshalfalah](https://www.kaggle.com/meshalfalah) |