File size: 5,639 Bytes
f717363
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---

language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
  example_title: "Arabic Greeting"
- text: "شلونك اليوم؟"
  example_title: "Iraqi Question"
- text: "عندي مشكلة بالانترنت"
  example_title: "Iraqi Complaint"
- text: "أحب القراءة كثيراً"
  example_title: "General Statement"
model-index:
- name: Arabic_MI_Classifier
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: custom
      name: Arabic Messages Dataset
    metrics:
    - type: accuracy
      value: 0.95
      name: Accuracy
---


# Arabic Message Classification Model

This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.

## Model Description

- **Developed by:** Ahmed Majid
- **Model type:** XLM-RoBERTa for Sequence Classification
- **Language(s):** Arabic (MSA and Iraqi dialect)
- **License:** MIT
- **Finetuned from model:** morit/arabic_xlm_xnli

## Intended Uses

### Direct Use

This model can be used for:
- Classifying Arabic messages in customer service systems
- Organizing Arabic text messages by intent
- Building chatbots for Arabic-speaking users
- Content moderation for Arabic forums and social media

### Downstream Use

The model can be further fine-tuned for:
- Other Arabic dialects
- Domain-specific message classification
- Multi-label classification tasks

## How to Get Started with the Model

Use the code below to get started with the model.

```python

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline



model_name = "ahmedmajid92/Arabic_MI_Classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name)



classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

result = classifier("السلام عليكم")

print(result)

```

## Training Details

### Training Data

The model was trained on a custom dataset of 5,000 Arabic messages:
- 50% Modern Standard Arabic (MSA)
- 50% Iraqi dialect
- 4 classes: greeting, question, complaint, general
- Balanced distribution: 1,250 examples per class

### Training Procedure

#### Preprocessing

- Tokenization using XLM-RoBERTa tokenizer
- Maximum sequence length: 128 tokens
- Padding and truncation applied

#### Training Hyperparameters

- **Training regime:** fp32
- **Epochs:** 20
- **Batch size:** 8 (training), 16 (evaluation)
- **Learning rate:** Default AdamW
- **Warmup steps:** Not specified
- **Weight decay:** Default
- **Optimizer:** AdamW

#### Speeds, Sizes, Times

- **Model size:** ~280M parameters
- **Training time:** Approximately 2-3 hours on GPU
- **Inference time:** ~50ms per message on GPU

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- 10% of the custom dataset (500 examples)
- Balanced across all 4 classes
- Mix of MSA and Iraqi dialect

#### Factors

- **Dialects:** MSA vs Iraqi dialect
- **Message length:** Short to medium messages
- **Domain:** General conversational messages

#### Metrics

- **Accuracy:** Primary metric
- **Per-class performance:** Evaluated for each label

### Results

The model achieves good performance across all classes with particular strength in:
- Greeting detection (both MSA and Iraqi)
- Question identification
- Complaint classification
- General statement recognition

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** XLM-RoBERTa with classification head
- **Objective:** Multi-class text classification
- **Base model:** morit/arabic_xlm_xnli
- **Classification head:** 4 output classes

### Compute Infrastructure

#### Hardware

- **GPU:** NVIDIA GPU (recommended)
- **Memory:** 8GB+ GPU memory recommended

#### Software

- **Framework:** PyTorch
- **Libraries:** Transformers, Datasets
- **Python version:** 3.8+

## Bias, Risks, and Limitations

### Bias

- The model may exhibit biases present in the training data
- Performance may vary between different Arabic dialects
- Regional variations in Iraqi dialect may not be fully captured

### Risks

- Misclassification of ambiguous messages
- Potential cultural bias in greeting/complaint detection
- Limited generalization to formal/informal register variations

### Limitations

- Only supports 4 predefined classes
- Optimized for MSA and Iraqi dialect specifically
- Maximum input length of 128 tokens
- May not generalize well to other Arabic dialects

## Additional Information

### Author

Ahmed Majid

### Licensing Information

This model is released under the MIT License.

### Citation Information

```bibtex

@misc{arabic-mi-classifier,

  title={Arabic Message Classification Model},

  author={Ahmed Majid},

  year={2025},

  howpublished={Hugging Face Model Hub},

  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}

}

```

### Acknowledgments

- Based on the XLM-RoBERTa model by Facebook AI
- Fine-tuned from morit/arabic_xlm_xnli
- Trained on custom Arabic message dataset