ahmedmajid92
/

Arabic_MI_Classifier

@@ -1,182 +1,226 @@
-# Arabic Message Classification Model
-## Model Description
-This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
-## Model Details
-- **Base Model**: `morit/arabic_xlm_xnli`
-- **Architecture**: XLMRobertaForSequenceClassification
-- **Language**: Arabic (MSA and Iraqi dialect)
-- **Task**: Text Classification
-- **Number of Labels**: 4
-- **Model Size**: ~280M parameters
-## Labels
-The model classifies messages into four categories:
-| Label ID | Label Name | Description | Examples |
-|----------|------------|-------------|----------|
-| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
-| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
-| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
-| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
-## Training Data
-The model was trained on a custom dataset containing:
-- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
-- **Balanced distribution**: 1,250 examples per class
-- **Train/Test Split**: 90%/10%
-## Training Details
-- **Training Epochs**: 20
-- **Batch Size**: 8 (training), 16 (evaluation)
-- **Learning Rate**: Default AdamW optimizer
-- **Maximum Sequence Length**: 128 tokens
-- **Evaluation Strategy**: Every 500 steps
-## Usage
-### Using Transformers Pipeline
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
-# Load the model and tokenizer
-model_name = "ahmedmajid92/Arabic_MI_Classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-# Create a classification pipeline
-classifier = pipeline(
-    "text-classification",
-    model=model,
-    tokenizer=tokenizer
-)
-# Classify a message
-text = "السلام عليكم ورحمة الله"
-result = classifier(text)
-print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
-```
-### Using the Model Directly
-```python
-from transformers import AutoTokenizer, AutoModelForSequenceClassification
-import torch
-# Load model and tokenizer
-model_name = "ahmedmajid92/Arabic_MI_Classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-# Tokenize input
-text = "شلونك اليوم؟"
-inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
-# Get predictions
-with torch.no_grad():
-    outputs = model(**inputs)
-    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
-    predicted_class_id = predictions.argmax().item()
-    confidence = predictions.max().item()
-# Map to label names
-id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
-predicted_label = id2label[predicted_class_id]
-print(f"Text: {text}")
-print(f"Predicted Label: {predicted_label}")
-print(f"Confidence: {confidence:.4f}")
-```
-### Gradio Web Interface
-```python
-import gradio as gr
-from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
-# Load model
-model_name = "ahmedmajid92/Arabic_MI_Classifier"
-tokenizer = AutoTokenizer.from_pretrained(model_name)
-model = AutoModelForSequenceClassification.from_pretrained(model_name)
-# Create classifier
-classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
-def classify_text(text):
-    result = classifier(text)[0]
-    return result["label"], float(result["score"])
-# Create Gradio interface
-iface = gr.Interface(
-    fn=classify_text,
-    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
-    outputs=[
-        gr.Textbox(label="Predicted Label"),
-        gr.Number(label="Confidence")
-    ],
-    title="Arabic Message Classifier",
-    description="Classify Arabic messages into: greeting, question, complaint, or general."
-)
-iface.launch()
-```
-## Model Performance
-The model achieves good performance on the test set, particularly effective at:
-- Distinguishing between greetings and general statements
-- Identifying questions in both MSA and Iraqi dialect
-- Classifying complaints and technical issues
-- Handling mixed dialectal variations
-## Supported Dialects
-- **Modern Standard Arabic (MSA)**: Formal Arabic text
-- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
-## Limitations
-- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
-- Limited to 4 predefined categories
-- Performance depends on the similarity of input text to training data patterns
-- Maximum input length is 128 tokens
-## Ethical Considerations
-This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
-- The model may reflect biases present in the training data
-- Performance may vary across different Arabic dialects not represented in training
-- The model should not be used for sensitive applications without proper validation
-## Citation
-If you use this model in your research, please cite:
-```bibtex
-@misc{arabic-mi-classifier,
-  title={Arabic Message Classification Model},
-  author={Ahmed Majid},
-  year={2025},
-  howpublished={Hugging Face Model Hub},
-  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
-}
-```
-## Model Card
-For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
-## Contact
-For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
-## License
-This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.

+---
+language: ar
+license: mit
+library_name: transformers
+pipeline_tag: text-classification
+datasets:
+- custom
+tags:
+- arabic
+- text-classification
+- iraqi-dialect
+- msa
+- message-classification
+- xlm-roberta
+- fine-tuned
+widget:
+- text: "السلام عليكم ورحمة الله وبركاته"
+  example_title: "Arabic Greeting (MSA)"
+- text: "هلو شلونك اليوم؟"
+  example_title: "Iraqi Greeting + Question"
+- text: "متى يبدأ الاجتماع؟"
+  example_title: "Question (MSA)"
+- text: "عندي مشكلة بالانترنت"
+  example_title: "Complaint (Iraqi)"
+- text: "أحب القراءة والكتابة"
+  example_title: "General Statement (MSA)"
+- text: "الكهرباء نفطت"
+  example_title: "Complaint (Iraqi)"
+model-index:
+- name: Arabic_MI_Classifier
+  results:
+  - task:
+      type: text-classification
+      name: Text Classification
+    dataset:
+      type: custom
+      name: Arabic Messages Dataset (MSA + Iraqi)
+    metrics:
+    - type: accuracy
+      value: 0.95
+      name: Accuracy
+base_model: morit/arabic_xlm_xnli
+---
+# Arabic Message Classification Model
+## Model Description
+This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.
+## Model Details
+- **Base Model**: `morit/arabic_xlm_xnli`
+- **Architecture**: XLMRobertaForSequenceClassification
+- **Language**: Arabic (MSA and Iraqi dialect)
+- **Task**: Text Classification
+- **Number of Labels**: 4
+- **Model Size**: ~280M parameters
+## Labels
+The model classifies messages into four categories:
+| Label ID | Label Name | Description | Examples |
+|----------|------------|-------------|----------|
+| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
+| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
+| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
+| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |
+## Training Data
+The model was trained on a custom dataset containing:
+- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
+- **Balanced distribution**: 1,250 examples per class
+- **Train/Test Split**: 90%/10%
+## Training Details
+- **Training Epochs**: 20
+- **Batch Size**: 8 (training), 16 (evaluation)
+- **Learning Rate**: Default AdamW optimizer
+- **Maximum Sequence Length**: 128 tokens
+- **Evaluation Strategy**: Every 500 steps
+## Usage
+### Using Transformers Pipeline
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+# Load the model and tokenizer
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Create a classification pipeline
+classifier = pipeline(
+    "text-classification",
+    model=model,
+    tokenizer=tokenizer
+)
+# Classify a message
+text = "السلام عليكم ورحمة الله"
+result = classifier(text)
+print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
+```
+### Using the Model Directly
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load model and tokenizer
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Tokenize input
+text = "شلونك اليوم؟"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
+# Get predictions
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_class_id = predictions.argmax().item()
+    confidence = predictions.max().item()
+# Map to label names
+id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
+predicted_label = id2label[predicted_class_id]
+print(f"Text: {text}")
+print(f"Predicted Label: {predicted_label}")
+print(f"Confidence: {confidence:.4f}")
+```
+### Gradio Web Interface
+```python
+import gradio as gr
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+# Load model
+model_name = "ahmedmajid92/Arabic_MI_Classifier"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name)
+# Create classifier
+classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
+def classify_text(text):
+    result = classifier(text)[0]
+    return result["label"], float(result["score"])
+# Create Gradio interface
+iface = gr.Interface(
+    fn=classify_text,
+    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
+    outputs=[
+        gr.Textbox(label="Predicted Label"),
+        gr.Number(label="Confidence")
+    ],
+    title="Arabic Message Classifier",
+    description="Classify Arabic messages into: greeting, question, complaint, or general."
+)
+iface.launch()
+```
+## Model Performance
+The model achieves good performance on the test set, particularly effective at:
+- Distinguishing between greetings and general statements
+- Identifying questions in both MSA and Iraqi dialect
+- Classifying complaints and technical issues
+- Handling mixed dialectal variations
+## Supported Dialects
+- **Modern Standard Arabic (MSA)**: Formal Arabic text
+- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary
+## Limitations
+- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
+- Limited to 4 predefined categories
+- Performance depends on the similarity of input text to training data patterns
+- Maximum input length is 128 tokens
+## Ethical Considerations
+This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
+- The model may reflect biases present in the training data
+- Performance may vary across different Arabic dialects not represented in training
+- The model should not be used for sensitive applications without proper validation
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{arabic-mi-classifier,
+  title={Arabic Message Classification Model},
+  author={Ahmed Majid},
+  year={2025},
+  howpublished={Hugging Face Model Hub},
+  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
+}
+```
+## Model Card
+For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.
+## Contact
+For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.
+## License
+This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.