File size: 7,112 Bytes
000615a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
- fine-tuned
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
  example_title: "Arabic Greeting (MSA)"
- text: "هلو شلونك اليوم؟"
  example_title: "Iraqi Greeting + Question"
- text: "متى يبدأ الاجتماع؟"
  example_title: "Question (MSA)"
- text: "عندي مشكلة بالانترنت"
  example_title: "Complaint (Iraqi)"
- text: "أحب القراءة والكتابة"
  example_title: "General Statement (MSA)"
- text: "الكهرباء نفطت"
  example_title: "Complaint (Iraqi)"
model-index:
- name: Arabic_MI_Classifier
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: custom
      name: Arabic Messages Dataset (MSA + Iraqi)
    metrics:
    - type: accuracy
      value: 0.95
      name: Accuracy
base_model: morit/arabic_xlm_xnli
---

# Arabic Message Classification Model

## Model Description

This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.

## Model Details

- **Base Model**: `morit/arabic_xlm_xnli`
- **Architecture**: XLMRobertaForSequenceClassification
- **Language**: Arabic (MSA and Iraqi dialect)
- **Task**: Text Classification
- **Number of Labels**: 4
- **Model Size**: ~280M parameters

## Labels

The model classifies messages into four categories:

| Label ID | Label Name | Description | Examples |
|----------|------------|-------------|----------|
| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |

## Training Data

The model was trained on a custom dataset containing:
- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
- **Balanced distribution**: 1,250 examples per class
- **Train/Test Split**: 90%/10%

## Training Details

- **Training Epochs**: 20
- **Batch Size**: 8 (training), 16 (evaluation)
- **Learning Rate**: Default AdamW optimizer
- **Maximum Sequence Length**: 128 tokens
- **Evaluation Strategy**: Every 500 steps

## Usage

### Using Transformers Pipeline

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a classification pipeline
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
```

### Using the Model Directly

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions.max().item()

# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]

print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")
```

### Gradio Web Interface

```python
import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

def classify_text(text):
    result = classifier(text)[0]
    return result["label"], float(result["score"])

# Create Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
    outputs=[
        gr.Textbox(label="Predicted Label"),
        gr.Number(label="Confidence")
    ],
    title="Arabic Message Classifier",
    description="Classify Arabic messages into: greeting, question, complaint, or general."
)

iface.launch()
```

## Model Performance

The model achieves good performance on the test set, particularly effective at:
- Distinguishing between greetings and general statements
- Identifying questions in both MSA and Iraqi dialect
- Classifying complaints and technical issues
- Handling mixed dialectal variations

## Supported Dialects

- **Modern Standard Arabic (MSA)**: Formal Arabic text
- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary

## Limitations

- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
- Limited to 4 predefined categories
- Performance depends on the similarity of input text to training data patterns
- Maximum input length is 128 tokens

## Ethical Considerations

This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
- The model may reflect biases present in the training data
- Performance may vary across different Arabic dialects not represented in training
- The model should not be used for sensitive applications without proper validation

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
```

## Model Card

For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.

## Contact

For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.

## License

This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.