metadata
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
widget:
- text: السلام عليكم ورحمة الله وبركاته
example_title: Arabic Greeting
- text: شلونك اليوم؟
example_title: Iraqi Question
- text: عندي مشكلة بالانترنت
example_title: Iraqi Complaint
- text: أحب القراءة كثيراً
example_title: General Statement
model-index:
- name: Arabic_MI_Classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
type: custom
name: Arabic Messages Dataset
metrics:
- type: accuracy
value: 0.95
name: Accuracy
Arabic Message Classification Model
This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
Model Description
- Developed by: Ahmed Majid
- Model type: XLM-RoBERTa for Sequence Classification
- Language(s): Arabic (MSA and Iraqi dialect)
- License: MIT
- Finetuned from model: morit/arabic_xlm_xnli
Intended Uses
Direct Use
This model can be used for:
- Classifying Arabic messages in customer service systems
- Organizing Arabic text messages by intent
- Building chatbots for Arabic-speaking users
- Content moderation for Arabic forums and social media
Downstream Use
The model can be further fine-tuned for:
- Other Arabic dialects
- Domain-specific message classification
- Multi-label classification tasks
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("السلام عليكم")
print(result)
Training Details
Training Data
The model was trained on a custom dataset of 5,000 Arabic messages:
- 50% Modern Standard Arabic (MSA)
- 50% Iraqi dialect
- 4 classes: greeting, question, complaint, general
- Balanced distribution: 1,250 examples per class
Training Procedure
Preprocessing
- Tokenization using XLM-RoBERTa tokenizer
- Maximum sequence length: 128 tokens
- Padding and truncation applied
Training Hyperparameters
- Training regime: fp32
- Epochs: 20
- Batch size: 8 (training), 16 (evaluation)
- Learning rate: Default AdamW
- Warmup steps: Not specified
- Weight decay: Default
- Optimizer: AdamW
Speeds, Sizes, Times
- Model size: ~280M parameters
- Training time: Approximately 2-3 hours on GPU
- Inference time: ~50ms per message on GPU
Evaluation
Testing Data, Factors & Metrics
Testing Data
- 10% of the custom dataset (500 examples)
- Balanced across all 4 classes
- Mix of MSA and Iraqi dialect
Factors
- Dialects: MSA vs Iraqi dialect
- Message length: Short to medium messages
- Domain: General conversational messages
Metrics
- Accuracy: Primary metric
- Per-class performance: Evaluated for each label
Results
The model achieves good performance across all classes with particular strength in:
- Greeting detection (both MSA and Iraqi)
- Question identification
- Complaint classification
- General statement recognition
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator.
Technical Specifications
Model Architecture and Objective
- Architecture: XLM-RoBERTa with classification head
- Objective: Multi-class text classification
- Base model: morit/arabic_xlm_xnli
- Classification head: 4 output classes
Compute Infrastructure
Hardware
- GPU: NVIDIA GPU (recommended)
- Memory: 8GB+ GPU memory recommended
Software
- Framework: PyTorch
- Libraries: Transformers, Datasets
- Python version: 3.8+
Bias, Risks, and Limitations
Bias
- The model may exhibit biases present in the training data
- Performance may vary between different Arabic dialects
- Regional variations in Iraqi dialect may not be fully captured
Risks
- Misclassification of ambiguous messages
- Potential cultural bias in greeting/complaint detection
- Limited generalization to formal/informal register variations
Limitations
- Only supports 4 predefined classes
- Optimized for MSA and Iraqi dialect specifically
- Maximum input length of 128 tokens
- May not generalize well to other Arabic dialects
Additional Information
Author
Ahmed Majid
Licensing Information
This model is released under the MIT License.
Citation Information
@misc{arabic-mi-classifier,
title={Arabic Message Classification Model},
author={Ahmed Majid},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
Acknowledgments
- Based on the XLM-RoBERTa model by Facebook AI
- Fine-tuned from morit/arabic_xlm_xnli
- Trained on custom Arabic message dataset