|
|
---
|
|
|
language: ar
|
|
|
license: mit
|
|
|
library_name: transformers
|
|
|
pipeline_tag: text-classification
|
|
|
datasets:
|
|
|
- custom
|
|
|
tags:
|
|
|
- arabic
|
|
|
- text-classification
|
|
|
- iraqi-dialect
|
|
|
- msa
|
|
|
- message-classification
|
|
|
- xlm-roberta
|
|
|
widget:
|
|
|
- text: "السلام عليكم ورحمة الله وبركاته"
|
|
|
example_title: "Arabic Greeting"
|
|
|
- text: "شلونك اليوم؟"
|
|
|
example_title: "Iraqi Question"
|
|
|
- text: "عندي مشكلة بالانترنت"
|
|
|
example_title: "Iraqi Complaint"
|
|
|
- text: "أحب القراءة كثيراً"
|
|
|
example_title: "General Statement"
|
|
|
model-index:
|
|
|
- name: Arabic_MI_Classifier
|
|
|
results:
|
|
|
- task:
|
|
|
type: text-classification
|
|
|
name: Text Classification
|
|
|
dataset:
|
|
|
type: custom
|
|
|
name: Arabic Messages Dataset
|
|
|
metrics:
|
|
|
- type: accuracy
|
|
|
value: 0.95
|
|
|
name: Accuracy
|
|
|
---
|
|
|
|
|
|
# Arabic Message Classification Model
|
|
|
|
|
|
This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
- **Developed by:** Ahmed Majid
|
|
|
- **Model type:** XLM-RoBERTa for Sequence Classification
|
|
|
- **Language(s):** Arabic (MSA and Iraqi dialect)
|
|
|
- **License:** MIT
|
|
|
- **Finetuned from model:** morit/arabic_xlm_xnli
|
|
|
|
|
|
## Intended Uses
|
|
|
|
|
|
### Direct Use
|
|
|
|
|
|
This model can be used for:
|
|
|
- Classifying Arabic messages in customer service systems
|
|
|
- Organizing Arabic text messages by intent
|
|
|
- Building chatbots for Arabic-speaking users
|
|
|
- Content moderation for Arabic forums and social media
|
|
|
|
|
|
### Downstream Use
|
|
|
|
|
|
The model can be further fine-tuned for:
|
|
|
- Other Arabic dialects
|
|
|
- Domain-specific message classification
|
|
|
- Multi-label classification tasks
|
|
|
|
|
|
## How to Get Started with the Model
|
|
|
|
|
|
Use the code below to get started with the model.
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
|
|
|
|
|
model_name = "ahmedmajid92/Arabic_MI_Classifier"
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
|
|
|
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
|
|
result = classifier("السلام عليكم")
|
|
|
print(result)
|
|
|
```
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Training Data
|
|
|
|
|
|
The model was trained on a custom dataset of 5,000 Arabic messages:
|
|
|
- 50% Modern Standard Arabic (MSA)
|
|
|
- 50% Iraqi dialect
|
|
|
- 4 classes: greeting, question, complaint, general
|
|
|
- Balanced distribution: 1,250 examples per class
|
|
|
|
|
|
### Training Procedure
|
|
|
|
|
|
#### Preprocessing
|
|
|
|
|
|
- Tokenization using XLM-RoBERTa tokenizer
|
|
|
- Maximum sequence length: 128 tokens
|
|
|
- Padding and truncation applied
|
|
|
|
|
|
#### Training Hyperparameters
|
|
|
|
|
|
- **Training regime:** fp32
|
|
|
- **Epochs:** 20
|
|
|
- **Batch size:** 8 (training), 16 (evaluation)
|
|
|
- **Learning rate:** Default AdamW
|
|
|
- **Warmup steps:** Not specified
|
|
|
- **Weight decay:** Default
|
|
|
- **Optimizer:** AdamW
|
|
|
|
|
|
#### Speeds, Sizes, Times
|
|
|
|
|
|
- **Model size:** ~280M parameters
|
|
|
- **Training time:** Approximately 2-3 hours on GPU
|
|
|
- **Inference time:** ~50ms per message on GPU
|
|
|
|
|
|
## Evaluation
|
|
|
|
|
|
### Testing Data, Factors & Metrics
|
|
|
|
|
|
#### Testing Data
|
|
|
|
|
|
- 10% of the custom dataset (500 examples)
|
|
|
- Balanced across all 4 classes
|
|
|
- Mix of MSA and Iraqi dialect
|
|
|
|
|
|
#### Factors
|
|
|
|
|
|
- **Dialects:** MSA vs Iraqi dialect
|
|
|
- **Message length:** Short to medium messages
|
|
|
- **Domain:** General conversational messages
|
|
|
|
|
|
#### Metrics
|
|
|
|
|
|
- **Accuracy:** Primary metric
|
|
|
- **Per-class performance:** Evaluated for each label
|
|
|
|
|
|
### Results
|
|
|
|
|
|
The model achieves good performance across all classes with particular strength in:
|
|
|
- Greeting detection (both MSA and Iraqi)
|
|
|
- Question identification
|
|
|
- Complaint classification
|
|
|
- General statement recognition
|
|
|
|
|
|
## Environmental Impact
|
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
|
|
|
|
|
|
## Technical Specifications
|
|
|
|
|
|
### Model Architecture and Objective
|
|
|
|
|
|
- **Architecture:** XLM-RoBERTa with classification head
|
|
|
- **Objective:** Multi-class text classification
|
|
|
- **Base model:** morit/arabic_xlm_xnli
|
|
|
- **Classification head:** 4 output classes
|
|
|
|
|
|
### Compute Infrastructure
|
|
|
|
|
|
#### Hardware
|
|
|
|
|
|
- **GPU:** NVIDIA GPU (recommended)
|
|
|
- **Memory:** 8GB+ GPU memory recommended
|
|
|
|
|
|
#### Software
|
|
|
|
|
|
- **Framework:** PyTorch
|
|
|
- **Libraries:** Transformers, Datasets
|
|
|
- **Python version:** 3.8+
|
|
|
|
|
|
## Bias, Risks, and Limitations
|
|
|
|
|
|
### Bias
|
|
|
|
|
|
- The model may exhibit biases present in the training data
|
|
|
- Performance may vary between different Arabic dialects
|
|
|
- Regional variations in Iraqi dialect may not be fully captured
|
|
|
|
|
|
### Risks
|
|
|
|
|
|
- Misclassification of ambiguous messages
|
|
|
- Potential cultural bias in greeting/complaint detection
|
|
|
- Limited generalization to formal/informal register variations
|
|
|
|
|
|
### Limitations
|
|
|
|
|
|
- Only supports 4 predefined classes
|
|
|
- Optimized for MSA and Iraqi dialect specifically
|
|
|
- Maximum input length of 128 tokens
|
|
|
- May not generalize well to other Arabic dialects
|
|
|
|
|
|
## Additional Information
|
|
|
|
|
|
### Author
|
|
|
|
|
|
Ahmed Majid
|
|
|
|
|
|
### Licensing Information
|
|
|
|
|
|
This model is released under the MIT License.
|
|
|
|
|
|
### Citation Information
|
|
|
|
|
|
```bibtex
|
|
|
@misc{arabic-mi-classifier,
|
|
|
title={Arabic Message Classification Model},
|
|
|
author={Ahmed Majid},
|
|
|
year={2025},
|
|
|
howpublished={Hugging Face Model Hub},
|
|
|
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
### Acknowledgments
|
|
|
|
|
|
- Based on the XLM-RoBERTa model by Facebook AI
|
|
|
- Fine-tuned from morit/arabic_xlm_xnli
|
|
|
- Trained on custom Arabic message dataset
|
|
|
|