--- language: ar license: apache-2.0 library_name: transformers tags: - sentiment-analysis - arabic - marbert - twitter - text-classification datasets: - mksaad/arabic-sentiment-twitter-corpus metrics: - accuracy - f1 - precision - recall --- # MARBERT Model for Arabic Sentiment Analysis (Positive/Negative) This is a fine-tuned version of `UBC-NLP/MARBERTv2` for Arabic Sentiment Analysis. The model is trained to classify Arabic text (specifically tweets) into two categories: **Positive (`LABEL_1`)** or **Negative (`LABEL_0`)**. ## 🚀 Live Demo You can test the model live on the Hugging Face Space: **[https://huggingface.co/spaces/iMeshal/arabic-sentiment-app](https://huggingface.co/spaces/iMeshal/arabic-sentiment-app)** --- ## 📊 Model Performance The model was trained on 80% of the training data and validated on 20%. The final evaluation was performed on a separate, unseen test set. **Final Test Set Results (Accuracy: 94.40%)** | Metric | Score | | :--- | :---: | | **Accuracy** | **94.40%** | | F1 (Macro) | 94.40% | | Precision (Macro) | 94.40% | | Recall (Macro) | 94.40% | | Loss | 0.1667 | The model achieved its best validation accuracy of **93.4%** at Epoch 2, and `load_best_model_at_end` was used. --- ## 💻 Intended Use (How to use) You can use this model directly with the `transformers` pipeline. ```python from transformers import pipeline # Load the pipeline pipe = pipeline( "sentiment-analysis", model="iMeshal/arabic-sentiment-classifier-marbert" ) # Test with new texts texts = [ "هذا المنتج رائع جداً أنصح به", "أسوأ خدمة عملاء على الإطلاق", "الجو اليوم جميل" ] results = pipe(texts) print(results) # Output: # [ # {'label': 'LABEL_1', 'score': 0.99...}, # Positive # {'label': 'LABEL_0', 'score': 0.99...}, # Negative # {'label': 'LABEL_1', 'score': 0.98...} # Positive # ] ``` ## 📚 Training Data The model was trained on the **[Arabic Sentiment Twitter Corpus](https://www.kaggle.com/datasets/mksaad/arabic-sentiment-twitter-corpus)** dataset from Kaggle. * **Preprocessing:** Long/concatenated tweets (which appeared to be noise) were cleaned. * **Training Set:** ~24,163 samples. * **Validation Set:** ~6,041 samples. * **Test Set:** ~11,508 samples. * **Balance:** All datasets were perfectly balanced (approx. 50% Positive / 50% Negative). --- ## ⚙️ Training Procedure The model was trained using the `transformers.Trainer` class with the following key hyperparameters: * **Framework:** PyTorch * **Base Model:** `UBC-NLP/MARBERTv2` * **Epochs:** 3 (with Early Stopping) * **Early Stopping:** Patience set to 2 (training stopped at Epoch 3, but Epoch 2 was the best). * **Batch Size:** 16 * **Learning Rate:** 2e-5 * **Tokenizer:** `AutoTokenizer` (with `padding="max_length"`, `truncation=True`, `max_length=512`) --- ### 📞 Contact * **Name:** Meshal AL-Qushaym * **Email:** meshalqushim@outlook.com * **Kaggle:** [kaggle.com/meshalfalah](https://www.kaggle.com/meshalfalah)