Arabic_MI_Classifier / model_card.md
ahmedmajid92's picture
Upload model_card.md
f717363 verified
---
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
example_title: "Arabic Greeting"
- text: "شلونك اليوم؟"
example_title: "Iraqi Question"
- text: "عندي مشكلة بالانترنت"
example_title: "Iraqi Complaint"
- text: "أحب القراءة كثيراً"
example_title: "General Statement"
model-index:
- name: Arabic_MI_Classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
type: custom
name: Arabic Messages Dataset
metrics:
- type: accuracy
value: 0.95
name: Accuracy
---
# Arabic Message Classification Model
This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.
## Model Description
- **Developed by:** Ahmed Majid
- **Model type:** XLM-RoBERTa for Sequence Classification
- **Language(s):** Arabic (MSA and Iraqi dialect)
- **License:** MIT
- **Finetuned from model:** morit/arabic_xlm_xnli
## Intended Uses
### Direct Use
This model can be used for:
- Classifying Arabic messages in customer service systems
- Organizing Arabic text messages by intent
- Building chatbots for Arabic-speaking users
- Content moderation for Arabic forums and social media
### Downstream Use
The model can be further fine-tuned for:
- Other Arabic dialects
- Domain-specific message classification
- Multi-label classification tasks
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("السلام عليكم")
print(result)
```
## Training Details
### Training Data
The model was trained on a custom dataset of 5,000 Arabic messages:
- 50% Modern Standard Arabic (MSA)
- 50% Iraqi dialect
- 4 classes: greeting, question, complaint, general
- Balanced distribution: 1,250 examples per class
### Training Procedure
#### Preprocessing
- Tokenization using XLM-RoBERTa tokenizer
- Maximum sequence length: 128 tokens
- Padding and truncation applied
#### Training Hyperparameters
- **Training regime:** fp32
- **Epochs:** 20
- **Batch size:** 8 (training), 16 (evaluation)
- **Learning rate:** Default AdamW
- **Warmup steps:** Not specified
- **Weight decay:** Default
- **Optimizer:** AdamW
#### Speeds, Sizes, Times
- **Model size:** ~280M parameters
- **Training time:** Approximately 2-3 hours on GPU
- **Inference time:** ~50ms per message on GPU
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- 10% of the custom dataset (500 examples)
- Balanced across all 4 classes
- Mix of MSA and Iraqi dialect
#### Factors
- **Dialects:** MSA vs Iraqi dialect
- **Message length:** Short to medium messages
- **Domain:** General conversational messages
#### Metrics
- **Accuracy:** Primary metric
- **Per-class performance:** Evaluated for each label
### Results
The model achieves good performance across all classes with particular strength in:
- Greeting detection (both MSA and Iraqi)
- Question identification
- Complaint classification
- General statement recognition
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
## Technical Specifications
### Model Architecture and Objective
- **Architecture:** XLM-RoBERTa with classification head
- **Objective:** Multi-class text classification
- **Base model:** morit/arabic_xlm_xnli
- **Classification head:** 4 output classes
### Compute Infrastructure
#### Hardware
- **GPU:** NVIDIA GPU (recommended)
- **Memory:** 8GB+ GPU memory recommended
#### Software
- **Framework:** PyTorch
- **Libraries:** Transformers, Datasets
- **Python version:** 3.8+
## Bias, Risks, and Limitations
### Bias
- The model may exhibit biases present in the training data
- Performance may vary between different Arabic dialects
- Regional variations in Iraqi dialect may not be fully captured
### Risks
- Misclassification of ambiguous messages
- Potential cultural bias in greeting/complaint detection
- Limited generalization to formal/informal register variations
### Limitations
- Only supports 4 predefined classes
- Optimized for MSA and Iraqi dialect specifically
- Maximum input length of 128 tokens
- May not generalize well to other Arabic dialects
## Additional Information
### Author
Ahmed Majid
### Licensing Information
This model is released under the MIT License.
### Citation Information
```bibtex
@misc{arabic-mi-classifier,
title={Arabic Message Classification Model},
author={Ahmed Majid},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
```
### Acknowledgments
- Based on the XLM-RoBERTa model by Facebook AI
- Fine-tuned from morit/arabic_xlm_xnli
- Trained on custom Arabic message dataset