ahmedmajid92
/

Arabic_MI_Classifier

+# Arabic Message Classification Model
+## Model Description
+This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
+## Model Details
+- **Base Model**: `morit/arabic_xlm_xnli`
+- **Architecture**: XLMRobertaForSequenceClassification
+- **Language**: Arabic (MSA and Iraqi dialect)
+- **Task**: Text Classification
+- **Number of Labels**: 4
+- **Model Size**: ~280M parameters
+## Labels
+The model classifies messages into four categories:
+| Label ID | Label Name | Description | Examples |
+|----------|------------|-------------|----------|
+| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
+| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
+| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
+| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
+## Training Data
+The model was trained on a custom dataset containing:
+- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
+- **Balanced distribution**: 1,250 examples per class
+- **Train/Test Split**: 90%/10%
+## Training Details
+- **Training Epochs**: 20
+- **Batch Size**: 8 (training), 16 (evaluation)
+- **Learning Rate**: Default AdamW optimizer
+- **Maximum Sequence Length**: 128 tokens
+- **Evaluation Strategy**: Every 500 steps
+## Usage
+### Using Transformers Pipeline
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+# Load the model and tokenizer
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Create a classification pipeline
+classifier = pipeline(
+    "text-classification",
+    model=model,
+    tokenizer=tokenizer
+)
+# Classify a message
+text = "السلام عليكم ورحمة الله"
+result = classifier(text)
+print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
+```
+### Using the Model Directly
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Tokenize input
+text = "شلونك اليوم؟"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class_id = predictions.argmax().item()
+    confidence = predictions.max().item()
+# Map to label names
+id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
+predicted_label = id2label[predicted_class_id]
+print(f"Text: {text}")
+print(f"Predicted Label: {predicted_label}")
+print(f"Confidence: {confidence:.4f}")
+```
+### Gradio Web Interface
+```python
+import gradio as gr
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+# Load model
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Create classifier
+classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
+def classify_text(text):
+    result = classifier(text)[0]
+    return result["label"], float(result["score"])
+# Create Gradio interface
+iface = gr.Interface(
+    fn=classify_text,
+    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
+    outputs=[
+        gr.Textbox(label="Predicted Label"),
+        gr.Number(label="Confidence")
+    ],
+    title="Arabic Message Classifier",
+    description="Classify Arabic messages into: greeting, question, complaint, or general."
+)
+iface.launch()
+```
+## Model Performance
+The model achieves good performance on the test set, particularly effective at:
+- Distinguishing between greetings and general statements
+- Identifying questions in both MSA and Iraqi dialect
+- Classifying complaints and technical issues
+- Handling mixed dialectal variations
+## Supported Dialects
+- **Modern Standard Arabic (MSA)**: Formal Arabic text
+- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
+## Limitations
+- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
+- Limited to 4 predefined categories
+- Performance depends on the similarity of input text to training data patterns
+- Maximum input length is 128 tokens
+## Ethical Considerations
+This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
+- The model may reflect biases present in the training data
+- Performance may vary across different Arabic dialects not represented in training
+- The model should not be used for sensitive applications without proper validation
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{arabic-mi-classifier,
+  title={Arabic Message Classification Model},
+  author={Ahmed Majid},
+  year={2025},
+  howpublished={Hugging Face Model Hub},
+  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
+}
+```
+## Model Card
+For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
+## Contact
+For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
+## License
+This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.