ahmedmajid92 commited on
Commit
4f31dd2
·
verified ·
1 Parent(s): 9dd6c1b

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +182 -3
README.md CHANGED
@@ -1,3 +1,182 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Arabic Message Classification Model
2
+
3
+ ## Model Description
4
+
5
+ This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
6
+
7
+ ## Model Details
8
+
9
+ - **Base Model**: `morit/arabic_xlm_xnli`
10
+ - **Architecture**: XLMRobertaForSequenceClassification
11
+ - **Language**: Arabic (MSA and Iraqi dialect)
12
+ - **Task**: Text Classification
13
+ - **Number of Labels**: 4
14
+ - **Model Size**: ~280M parameters
15
+
16
+ ## Labels
17
+
18
+ The model classifies messages into four categories:
19
+
20
+ | Label ID | Label Name | Description | Examples |
21
+ |----------|------------|-------------|----------|
22
+ | 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
23
+ | 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
24
+ | 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
25
+ | 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
26
+
27
+ ## Training Data
28
+
29
+ The model was trained on a custom dataset containing:
30
+ - **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
31
+ - **Balanced distribution**: 1,250 examples per class
32
+ - **Train/Test Split**: 90%/10%
33
+
34
+ ## Training Details
35
+
36
+ - **Training Epochs**: 20
37
+ - **Batch Size**: 8 (training), 16 (evaluation)
38
+ - **Learning Rate**: Default AdamW optimizer
39
+ - **Maximum Sequence Length**: 128 tokens
40
+ - **Evaluation Strategy**: Every 500 steps
41
+
42
+ ## Usage
43
+
44
+ ### Using Transformers Pipeline
45
+
46
+ ```python
47
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
48
+
49
+ # Load the model and tokenizer
50
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
53
+
54
+ # Create a classification pipeline
55
+ classifier = pipeline(
56
+ "text-classification",
57
+ model=model,
58
+ tokenizer=tokenizer
59
+ )
60
+
61
+ # Classify a message
62
+ text = "السلام عليكم ورحمة الله"
63
+ result = classifier(text)
64
+ print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
65
+ ```
66
+
67
+ ### Using the Model Directly
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
71
+ import torch
72
+
73
+ # Load model and tokenizer
74
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
75
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
76
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
77
+
78
+ # Tokenize input
79
+ text = "شلونك اليوم؟"
80
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
81
+
82
+ # Get predictions
83
+ with torch.no_grad():
84
+ outputs = model(**inputs)
85
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
86
+ predicted_class_id = predictions.argmax().item()
87
+ confidence = predictions.max().item()
88
+
89
+ # Map to label names
90
+ id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
91
+ predicted_label = id2label[predicted_class_id]
92
+
93
+ print(f"Text: {text}")
94
+ print(f"Predicted Label: {predicted_label}")
95
+ print(f"Confidence: {confidence:.4f}")
96
+ ```
97
+
98
+ ### Gradio Web Interface
99
+
100
+ ```python
101
+ import gradio as gr
102
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
103
+
104
+ # Load model
105
+ model_name = "ahmedmajid92/Arabic_MI_Classifier"
106
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
107
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
108
+
109
+ # Create classifier
110
+ classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
111
+
112
+ def classify_text(text):
113
+ result = classifier(text)[0]
114
+ return result["label"], float(result["score"])
115
+
116
+ # Create Gradio interface
117
+ iface = gr.Interface(
118
+ fn=classify_text,
119
+ inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
120
+ outputs=[
121
+ gr.Textbox(label="Predicted Label"),
122
+ gr.Number(label="Confidence")
123
+ ],
124
+ title="Arabic Message Classifier",
125
+ description="Classify Arabic messages into: greeting, question, complaint, or general."
126
+ )
127
+
128
+ iface.launch()
129
+ ```
130
+
131
+ ## Model Performance
132
+
133
+ The model achieves good performance on the test set, particularly effective at:
134
+ - Distinguishing between greetings and general statements
135
+ - Identifying questions in both MSA and Iraqi dialect
136
+ - Classifying complaints and technical issues
137
+ - Handling mixed dialectal variations
138
+
139
+ ## Supported Dialects
140
+
141
+ - **Modern Standard Arabic (MSA)**: Formal Arabic text
142
+ - **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
143
+
144
+ ## Limitations
145
+
146
+ - The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
147
+ - Limited to 4 predefined categories
148
+ - Performance depends on the similarity of input text to training data patterns
149
+ - Maximum input length is 128 tokens
150
+
151
+ ## Ethical Considerations
152
+
153
+ This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
154
+ - The model may reflect biases present in the training data
155
+ - Performance may vary across different Arabic dialects not represented in training
156
+ - The model should not be used for sensitive applications without proper validation
157
+
158
+ ## Citation
159
+
160
+ If you use this model in your research, please cite:
161
+
162
+ ```bibtex
163
+ @misc{arabic-mi-classifier,
164
+ title={Arabic Message Classification Model},
165
+ author={Ahmed Majid},
166
+ year={2025},
167
+ howpublished={Hugging Face Model Hub},
168
+ url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
169
+ }
170
+ ```
171
+
172
+ ## Model Card
173
+
174
+ For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
175
+
176
+ ## Contact
177
+
178
+ For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
179
+
180
+ ## License
181
+
182
+ This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.