ahmedmajid92's picture
Update README.md
000615a verified
---
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
- fine-tuned
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
example_title: "Arabic Greeting (MSA)"
- text: "هلو شلونك اليوم؟"
example_title: "Iraqi Greeting + Question"
- text: "متى يبدأ الاجتماع؟"
example_title: "Question (MSA)"
- text: "عندي مشكلة بالانترنت"
example_title: "Complaint (Iraqi)"
- text: "أحب القراءة والكتابة"
example_title: "General Statement (MSA)"
- text: "الكهرباء نفطت"
example_title: "Complaint (Iraqi)"
model-index:
- name: Arabic_MI_Classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
type: custom
name: Arabic Messages Dataset (MSA + Iraqi)
metrics:
- type: accuracy
value: 0.95
name: Accuracy
base_model: morit/arabic_xlm_xnli
---
# Arabic Message Classification Model
## Model Description
This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
## Model Details
- **Base Model**: `morit/arabic_xlm_xnli`
- **Architecture**: XLMRobertaForSequenceClassification
- **Language**: Arabic (MSA and Iraqi dialect)
- **Task**: Text Classification
- **Number of Labels**: 4
- **Model Size**: ~280M parameters
## Labels
The model classifies messages into four categories:
| Label ID | Label Name | Description | Examples |
|----------|------------|-------------|----------|
| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
## Training Data
The model was trained on a custom dataset containing:
- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
- **Balanced distribution**: 1,250 examples per class
- **Train/Test Split**: 90%/10%
## Training Details
- **Training Epochs**: 20
- **Batch Size**: 8 (training), 16 (evaluation)
- **Learning Rate**: Default AdamW optimizer
- **Maximum Sequence Length**: 128 tokens
- **Evaluation Strategy**: Every 500 steps
## Usage
### Using Transformers Pipeline
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create a classification pipeline
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer
)
# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
```
### Using the Model Directly
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()
# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]
print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")
```
### Gradio Web Interface
```python
import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
def classify_text(text):
result = classifier(text)[0]
return result["label"], float(result["score"])
# Create Gradio interface
iface = gr.Interface(
fn=classify_text,
inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
outputs=[
gr.Textbox(label="Predicted Label"),
gr.Number(label="Confidence")
],
title="Arabic Message Classifier",
description="Classify Arabic messages into: greeting, question, complaint, or general."
)
iface.launch()
```
## Model Performance
The model achieves good performance on the test set, particularly effective at:
- Distinguishing between greetings and general statements
- Identifying questions in both MSA and Iraqi dialect
- Classifying complaints and technical issues
- Handling mixed dialectal variations
## Supported Dialects
- **Modern Standard Arabic (MSA)**: Formal Arabic text
- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
## Limitations
- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
- Limited to 4 predefined categories
- Performance depends on the similarity of input text to training data patterns
- Maximum input length is 128 tokens
## Ethical Considerations
This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
- The model may reflect biases present in the training data
- Performance may vary across different Arabic dialects not represented in training
- The model should not be used for sensitive applications without proper validation
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{arabic-mi-classifier,
title={Arabic Message Classification Model},
author={Ahmed Majid},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
```
## Model Card
For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
## Contact
For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
## License
This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.