File size: 7,112 Bytes

000615a

---
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
- fine-tuned
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
  example_title: "Arabic Greeting (MSA)"
- text: "هلو شلونك اليوم؟"
  example_title: "Iraqi Greeting + Question"
- text: "متى يبدأ الاجتماع؟"
  example_title: "Question (MSA)"
- text: "عندي مشكلة بالانترنت"
  example_title: "Complaint (Iraqi)"
- text: "أحب القراءة والكتابة"
  example_title: "General Statement (MSA)"
- text: "الكهرباء نفطت"
  example_title: "Complaint (Iraqi)"
model-index:
- name: Arabic_MI_Classifier
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: custom
      name: Arabic Messages Dataset (MSA + Iraqi)
    metrics:
    - type: accuracy
      value: 0.95
      name: Accuracy
base_model: morit/arabic_xlm_xnli
---

# Arabic Message Classification Model

## Model Description

This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.

## Model Details

- **Base Model**: `morit/arabic_xlm_xnli`
- **Architecture**: XLMRobertaForSequenceClassification
- **Language**: Arabic (MSA and Iraqi dialect)
- **Task**: Text Classification
- **Number of Labels**: 4
- **Model Size**: ~280M parameters

## Labels

The model classifies messages into four categories:

| Label ID | Label Name | Description | Examples |
|----------|------------|-------------|----------|
| 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" |
| 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" |
| 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" |
| 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" |

## Training Data

The model was trained on a custom dataset containing:
- **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect)
- **Balanced distribution**: 1,250 examples per class
- **Train/Test Split**: 90%/10%

## Training Details

- **Training Epochs**: 20
- **Batch Size**: 8 (training), 16 (evaluation)
- **Learning Rate**: Default AdamW optimizer
- **Maximum Sequence Length**: 128 tokens
- **Evaluation Strategy**: Every 500 steps

## Usage

### Using Transformers Pipeline

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load the model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create a classification pipeline
classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer
)

# Classify a message
text = "السلام عليكم ورحمة الله"
result = classifier(text)
print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
```

### Using the Model Directly

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Tokenize input
text = "شلونك اليوم؟"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class_id = predictions.argmax().item()
    confidence = predictions.max().item()

# Map to label names
id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
predicted_label = id2label[predicted_class_id]

print(f"Text: {text}")
print(f"Predicted Label: {predicted_label}")
print(f"Confidence: {confidence:.4f}")
```

### Gradio Web Interface

```python
import gradio as gr
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load model
model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Create classifier
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

def classify_text(text):
    result = classifier(text)[0]
    return result["label"], float(result["score"])

# Create Gradio interface
iface = gr.Interface(
    fn=classify_text,
    inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
    outputs=[
        gr.Textbox(label="Predicted Label"),
        gr.Number(label="Confidence")
    ],
    title="Arabic Message Classifier",
    description="Classify Arabic messages into: greeting, question, complaint, or general."
)

iface.launch()
```

## Model Performance

The model achieves good performance on the test set, particularly effective at:
- Distinguishing between greetings and general statements
- Identifying questions in both MSA and Iraqi dialect
- Classifying complaints and technical issues
- Handling mixed dialectal variations

## Supported Dialects

- **Modern Standard Arabic (MSA)**: Formal Arabic text
- **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary

## Limitations

- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
- Limited to 4 predefined categories
- Performance depends on the similarity of input text to training data patterns
- Maximum input length is 128 tokens

## Ethical Considerations

This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
- The model may reflect biases present in the training data
- Performance may vary across different Arabic dialects not represented in training
- The model should not be used for sensitive applications without proper validation

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}
```

## Model Card

For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.

## Contact

For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.

## License

This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.