File size: 7,112 Bytes
000615a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 |
---
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
- fine-tuned
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
example_title: "Arabic Greeting (MSA)"
- text: "هلو شلونك اليوم؟"
example_title: "Iraqi Greeting + Question"
- text: "متى يبدأ الاجتماع؟"
example_title: "Question (MSA)"
- text: "عندي مشكلة بالانترنت"
example_title: "Complaint (Iraqi)"
- text: "أحب القراءة والكتابة"
example_title: "General Statement (MSA)"
- text: "الكهرباء نفطت"
example_title: "Complaint (Iraqi)"
model-index:
- name: Arabic_MI_Classifier
results:
- task:
type: text-classification
name: Text Classification
dataset:
type: custom
name: Arabic Messages Dataset (MSA + Iraqi)
metrics:
- type: accuracy
value: 0.95
name: Accuracy
base_model: morit/arabic_xlm_xnli
---
# Arabic Message Classification Model
## Model Description
This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
## Model Details
- **Base Model**: `morit/arabic_xlm_xnli`
- **Architecture**: XLMRobertaForSequenceClassification
- **Language**: Arabic (MSA and Iraqi dialect)
- **Task**: Text Classification
- **Number of Labels**: 4
- **Model Size**: ~280M parameters
## Labels
The model classifies messages into four categories:
| Label ID | Label Name | Description | Examples |
|----------|------------|-------------|----------|
| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
## Training Data
The model was trained on a custom dataset containing:
- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
- **Balanced distribution**: 1,250 examples per class
- **Train/Test Split**: 90%/10%
## Training Details
- **Training Epochs**: 20
- **Batch Size**: 8 (training), 16 (evaluation)
- **Learning Rate**: Default AdamW optimizer
- **Maximum Sequence Length**: 128 tokens
- **Evaluation Strategy**: Every 500 steps
## Usage
### Using Transformers Pipeline
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create a classification pipeline
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer
)
# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
```
### Using the Model Directly
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class_id = predictions.argmax().item()
confidence = predictions.max().item()
# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]
print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")
```
### Gradio Web Interface
```python
import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
def classify_text(text):
result = classifier(text)[0]
return result["label"], float(result["score"])
# Create Gradio interface
iface = gr.Interface(
fn=classify_text,
inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
outputs=[
gr.Textbox(label="Predicted Label"),
gr.Number(label="Confidence")
],
title="Arabic Message Classifier",
description="Classify Arabic messages into: greeting, question, complaint, or general."
)
iface.launch()
```
## Model Performance
The model achieves good performance on the test set, particularly effective at:
- Distinguishing between greetings and general statements
- Identifying questions in both MSA and Iraqi dialect
- Classifying complaints and technical issues
- Handling mixed dialectal variations
## Supported Dialects
- **Modern Standard Arabic (MSA)**: Formal Arabic text
- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
## Limitations
- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
- Limited to 4 predefined categories
- Performance depends on the similarity of input text to training data patterns
- Maximum input length is 128 tokens
## Ethical Considerations
This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
- The model may reflect biases present in the training data
- Performance may vary across different Arabic dialects not represented in training
- The model should not be used for sensitive applications without proper validation
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{arabic-mi-classifier,
title={Arabic Message Classification Model},
author={Ahmed Majid},
year={2025},
howpublished={Hugging Face Model Hub},
url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
```
## Model Card
For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
## Contact
For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
## License
This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.
|