ahmedmajid92 commited on
Commit
000615a
·
verified ·
1 Parent(s): 3922543

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -182
README.md CHANGED
@@ -1,182 +1,226 @@
1
- # Arabic Message Classification Model
2
-
3
- ## Model Description
4
-
5
- This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
6
-
7
- ## Model Details
8
-
9
- - **Base Model**: `morit/arabic_xlm_xnli`
10
- - **Architecture**: XLMRobertaForSequenceClassification
11
- - **Language**: Arabic (MSA and Iraqi dialect)
12
- - **Task**: Text Classification
13
- - **Number of Labels**: 4
14
- - **Model Size**: ~280M parameters
15
-
16
- ## Labels
17
-
18
- The model classifies messages into four categories:
19
-
20
- | Label ID | Label Name | Description | Examples |
21
- |----------|------------|-------------|----------|
22
- | 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
23
- | 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
24
- | 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
25
- | 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
26
-
27
- ## Training Data
28
-
29
- The model was trained on a custom dataset containing:
30
- - **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
31
- - **Balanced distribution**: 1,250 examples per class
32
- - **Train/Test Split**: 90%/10%
33
-
34
- ## Training Details
35
-
36
- - **Training Epochs**: 20
37
- - **Batch Size**: 8 (training), 16 (evaluation)
38
- - **Learning Rate**: Default AdamW optimizer
39
- - **Maximum Sequence Length**: 128 tokens
40
- - **Evaluation Strategy**: Every 500 steps
41
-
42
- ## Usage
43
-
44
- ### Using Transformers Pipeline
45
-
46
- ```python
47
- from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
48
-
49
- # Load the model and tokenizer
50
- model_name = "ahmedmajid92/Arabic_MI_Classifier"
51
- tokenizer = AutoTokenizer.from_pretrained(model_name)
52
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
53
-
54
- # Create a classification pipeline
55
- classifier = pipeline(
56
- "text-classification",
57
- model=model,
58
- tokenizer=tokenizer
59
- )
60
-
61
- # Classify a message
62
- text = "السلام عليكم ورحمة الله"
63
- result = classifier(text)
64
- print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
65
- ```
66
-
67
- ### Using the Model Directly
68
-
69
- ```python
70
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
71
- import torch
72
-
73
- # Load model and tokenizer
74
- model_name = "ahmedmajid92/Arabic_MI_Classifier"
75
- tokenizer = AutoTokenizer.from_pretrained(model_name)
76
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
77
-
78
- # Tokenize input
79
- text = "شلونك اليوم؟"
80
- inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
81
-
82
- # Get predictions
83
- with torch.no_grad():
84
- outputs = model(**inputs)
85
- predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
86
- predicted_class_id = predictions.argmax().item()
87
- confidence = predictions.max().item()
88
-
89
- # Map to label names
90
- id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
91
- predicted_label = id2label[predicted_class_id]
92
-
93
- print(f"Text: {text}")
94
- print(f"Predicted Label: {predicted_label}")
95
- print(f"Confidence: {confidence:.4f}")
96
- ```
97
-
98
- ### Gradio Web Interface
99
-
100
- ```python
101
- import gradio as gr
102
- from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
103
-
104
- # Load model
105
- model_name = "ahmedmajid92/Arabic_MI_Classifier"
106
- tokenizer = AutoTokenizer.from_pretrained(model_name)
107
- model = AutoModelForSequenceClassification.from_pretrained(model_name)
108
-
109
- # Create classifier
110
- classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
111
-
112
- def classify_text(text):
113
- result = classifier(text)[0]
114
- return result["label"], float(result["score"])
115
-
116
- # Create Gradio interface
117
- iface = gr.Interface(
118
- fn=classify_text,
119
- inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
120
- outputs=[
121
- gr.Textbox(label="Predicted Label"),
122
- gr.Number(label="Confidence")
123
- ],
124
- title="Arabic Message Classifier",
125
- description="Classify Arabic messages into: greeting, question, complaint, or general."
126
- )
127
-
128
- iface.launch()
129
- ```
130
-
131
- ## Model Performance
132
-
133
- The model achieves good performance on the test set, particularly effective at:
134
- - Distinguishing between greetings and general statements
135
- - Identifying questions in both MSA and Iraqi dialect
136
- - Classifying complaints and technical issues
137
- - Handling mixed dialectal variations
138
-
139
- ## Supported Dialects
140
-
141
- - **Modern Standard Arabic (MSA)**: Formal Arabic text
142
- - **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
143
-
144
- ## Limitations
145
-
146
- - The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
147
- - Limited to 4 predefined categories
148
- - Performance depends on the similarity of input text to training data patterns
149
- - Maximum input length is 128 tokens
150
-
151
- ## Ethical Considerations
152
-
153
- This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
154
- - The model may reflect biases present in the training data
155
- - Performance may vary across different Arabic dialects not represented in training
156
- - The model should not be used for sensitive applications without proper validation
157
-
158
- ## Citation
159
-
160
- If you use this model in your research, please cite:
161
-
162
- ```bibtex
163
- @misc{arabic-mi-classifier,
164
- title={Arabic Message Classification Model},
165
- author={Ahmed Majid},
166
- year={2025},
167
- howpublished={Hugging Face Model Hub},
168
- url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
169
- }
170
- ```
171
-
172
- ## Model Card
173
-
174
- For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
175
-
176
- ## Contact
177
-
178
- For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
179
-
180
- ## License
181
-
182
- This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ license: mit
4
+ library_name: transformers
5
+ pipeline_tag: text-classification
6
+ datasets:
7
+ - custom
8
+ tags:
9
+ - arabic
10
+ - text-classification
11
+ - iraqi-dialect
12
+ - msa
13
+ - message-classification
14
+ - xlm-roberta
15
+ - fine-tuned
16
+ widget:
17
+ - text: "السلام عليكم ورحمة الله وبركاته"
18
+ example_title: "Arabic Greeting (MSA)"
19
+ - text: "هلو شلونك اليوم؟"
20
+ example_title: "Iraqi Greeting + Question"
21
+ - text: "متى يبدأ الاجتماع؟"
22
+ example_title: "Question (MSA)"
23
+ - text: "عندي مشكلة بالانترنت"
24
+ example_title: "Complaint (Iraqi)"
25
+ - text: "أحب القراءة والكتابة"
26
+ example_title: "General Statement (MSA)"
27
+ - text: "الكهرباء نفطت"
28
+ example_title: "Complaint (Iraqi)"
29
+ model-index:
30
+ - name: Arabic_MI_Classifier
31
+ results:
32
+ - task:
33
+ type: text-classification
34
+ name: Text Classification
35
+ dataset:
36
+ type: custom
37
+ name: Arabic Messages Dataset (MSA + Iraqi)
38
+ metrics:
39
+ - type: accuracy
40
+ value: 0.95
41
+ name: Accuracy
42
+ base_model: morit/arabic_xlm_xnli
43
+ ---
44
+
45
+ # Arabic Message Classification Model
46
+
47
+ ## Model Description
48
+
49
+ This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
50
+
51
+ ## Model Details
52
+
53
+ - **Base Model**: `morit/arabic_xlm_xnli`
54
+ - **Architecture**: XLMRobertaForSequenceClassification
55
+ - **Language**: Arabic (MSA and Iraqi dialect)
56
+ - **Task**: Text Classification
57
+ - **Number of Labels**: 4
58
+ - **Model Size**: ~280M parameters
59
+
60
+ ## Labels
61
+
62
+ The model classifies messages into four categories:
63
+
64
+ | Label ID | Label Name | Description | Examples |
65
+ |----------|------------|-------------|----------|
66
+ | 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
67
+ | 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
68
+ | 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
69
+ | 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
70
+
71
+ ## Training Data
72
+
73
+ The model was trained on a custom dataset containing:
74
+ - **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
75
+ - **Balanced distribution**: 1,250 examples per class
76
+ - **Train/Test Split**: 90%/10%
77
+
78
+ ## Training Details
79
+
80
+ - **Training Epochs**: 20
81
+ - **Batch Size**: 8 (training), 16 (evaluation)
82
+ - **Learning Rate**: Default AdamW optimizer
83
+ - **Maximum Sequence Length**: 128 tokens
84
+ - **Evaluation Strategy**: Every 500 steps
85
+
86
+ ## Usage
87
+
88
+ ### Using Transformers Pipeline
89
+
90
+ ```python
91
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
92
+
93
+ # Load the model and tokenizer
94
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
95
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
96
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
97
+
98
+ # Create a classification pipeline
99
+ classifier = pipeline(
100
+ "text-classification",
101
+ model=model,
102
+ tokenizer=tokenizer
103
+ )
104
+
105
+ # Classify a message
106
+ text = "السلام عليكم ورحمة الله"
107
+ result = classifier(text)
108
+ print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
109
+ ```
110
+
111
+ ### Using the Model Directly
112
+
113
+ ```python
114
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
115
+ import torch
116
+
117
+ # Load model and tokenizer
118
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
119
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
120
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
121
+
122
+ # Tokenize input
123
+ text = "شلونك اليوم؟"
124
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
125
+
126
+ # Get predictions
127
+ with torch.no_grad():
128
+ outputs = model(**inputs)
129
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
130
+ predicted_class_id = predictions.argmax().item()
131
+ confidence = predictions.max().item()
132
+
133
+ # Map to label names
134
+ id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
135
+ predicted_label = id2label[predicted_class_id]
136
+
137
+ print(f"Text: {text}")
138
+ print(f"Predicted Label: {predicted_label}")
139
+ print(f"Confidence: {confidence:.4f}")
140
+ ```
141
+
142
+ ### Gradio Web Interface
143
+
144
+ ```python
145
+ import gradio as gr
146
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
147
+
148
+ # Load model
149
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
150
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
151
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
152
+
153
+ # Create classifier
154
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
155
+
156
+ def classify_text(text):
157
+ result = classifier(text)[0]
158
+ return result["label"], float(result["score"])
159
+
160
+ # Create Gradio interface
161
+ iface = gr.Interface(
162
+ fn=classify_text,
163
+ inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
164
+ outputs=[
165
+ gr.Textbox(label="Predicted Label"),
166
+ gr.Number(label="Confidence")
167
+ ],
168
+ title="Arabic Message Classifier",
169
+ description="Classify Arabic messages into: greeting, question, complaint, or general."
170
+ )
171
+
172
+ iface.launch()
173
+ ```
174
+
175
+ ## Model Performance
176
+
177
+ The model achieves good performance on the test set, particularly effective at:
178
+ - Distinguishing between greetings and general statements
179
+ - Identifying questions in both MSA and Iraqi dialect
180
+ - Classifying complaints and technical issues
181
+ - Handling mixed dialectal variations
182
+
183
+ ## Supported Dialects
184
+
185
+ - **Modern Standard Arabic (MSA)**: Formal Arabic text
186
+ - **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
187
+
188
+ ## Limitations
189
+
190
+ - The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
191
+ - Limited to 4 predefined categories
192
+ - Performance depends on the similarity of input text to training data patterns
193
+ - Maximum input length is 128 tokens
194
+
195
+ ## Ethical Considerations
196
+
197
+ This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
198
+ - The model may reflect biases present in the training data
199
+ - Performance may vary across different Arabic dialects not represented in training
200
+ - The model should not be used for sensitive applications without proper validation
201
+
202
+ ## Citation
203
+
204
+ If you use this model in your research, please cite:
205
+
206
+ ```bibtex
207
+ @misc{arabic-mi-classifier,
208
+ title={Arabic Message Classification Model},
209
+ author={Ahmed Majid},
210
+ year={2025},
211
+ howpublished={Hugging Face Model Hub},
212
+ url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
213
+ }
214
+ ```
215
+
216
+ ## Model Card
217
+
218
+ For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
219
+
220
+ ## Contact
221
+
222
+ For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
223
+
224
+ ## License
225
+
226
+ This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.