File size: 5,639 Bytes

f717363

---

language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
- custom
tags:
- arabic
- text-classification
- iraqi-dialect
- msa
- message-classification
- xlm-roberta
widget:
- text: "السلام عليكم ورحمة الله وبركاته"
  example_title: "Arabic Greeting"
- text: "شلونك اليوم؟"
  example_title: "Iraqi Question"
- text: "عندي مشكلة بالانترنت"
  example_title: "Iraqi Complaint"
- text: "أحب القراءة كثيراً"
  example_title: "General Statement"
model-index:
- name: Arabic_MI_Classifier
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: custom
      name: Arabic Messages Dataset
    metrics:
    - type: accuracy
      value: 0.95
      name: Accuracy
---


# Arabic Message Classification Model

This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.

## Model Description

- **Developed by:** Ahmed Majid
- **Model type:** XLM-RoBERTa for Sequence Classification
- **Language(s):** Arabic (MSA and Iraqi dialect)
- **License:** MIT
- **Finetuned from model:** morit/arabic_xlm_xnli

## Intended Uses

### Direct Use

This model can be used for:
- Classifying Arabic messages in customer service systems
- Organizing Arabic text messages by intent
- Building chatbots for Arabic-speaking users
- Content moderation for Arabic forums and social media

### Downstream Use

The model can be further fine-tuned for:
- Other Arabic dialects
- Domain-specific message classification
- Multi-label classification tasks

## How to Get Started with the Model

Use the code below to get started with the model.

```python

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline



model_name = "ahmedmajid92/Arabic_MI_Classifier"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name)



classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

result = classifier("السلام عليكم")

print(result)

```

## Training Details

### Training Data

The model was trained on a custom dataset of 5,000 Arabic messages:
- 50% Modern Standard Arabic (MSA)
- 50% Iraqi dialect
- 4 classes: greeting, question, complaint, general
- Balanced distribution: 1,250 examples per class

### Training Procedure

#### Preprocessing

- Tokenization using XLM-RoBERTa tokenizer
- Maximum sequence length: 128 tokens
- Padding and truncation applied

#### Training Hyperparameters

- **Training regime:** fp32
- **Epochs:** 20
- **Batch size:** 8 (training), 16 (evaluation)
- **Learning rate:** Default AdamW
- **Warmup steps:** Not specified
- **Weight decay:** Default
- **Optimizer:** AdamW

#### Speeds, Sizes, Times

- **Model size:** ~280M parameters
- **Training time:** Approximately 2-3 hours on GPU
- **Inference time:** ~50ms per message on GPU

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- 10% of the custom dataset (500 examples)
- Balanced across all 4 classes
- Mix of MSA and Iraqi dialect

#### Factors

- **Dialects:** MSA vs Iraqi dialect
- **Message length:** Short to medium messages
- **Domain:** General conversational messages

#### Metrics

- **Accuracy:** Primary metric
- **Per-class performance:** Evaluated for each label

### Results

The model achieves good performance across all classes with particular strength in:
- Greeting detection (both MSA and Iraqi)
- Question identification
- Complaint classification
- General statement recognition

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

## Technical Specifications

### Model Architecture and Objective

- **Architecture:** XLM-RoBERTa with classification head
- **Objective:** Multi-class text classification
- **Base model:** morit/arabic_xlm_xnli
- **Classification head:** 4 output classes

### Compute Infrastructure

#### Hardware

- **GPU:** NVIDIA GPU (recommended)
- **Memory:** 8GB+ GPU memory recommended

#### Software

- **Framework:** PyTorch
- **Libraries:** Transformers, Datasets
- **Python version:** 3.8+

## Bias, Risks, and Limitations

### Bias

- The model may exhibit biases present in the training data
- Performance may vary between different Arabic dialects
- Regional variations in Iraqi dialect may not be fully captured

### Risks

- Misclassification of ambiguous messages
- Potential cultural bias in greeting/complaint detection
- Limited generalization to formal/informal register variations

### Limitations

- Only supports 4 predefined classes
- Optimized for MSA and Iraqi dialect specifically
- Maximum input length of 128 tokens
- May not generalize well to other Arabic dialects

## Additional Information

### Author

Ahmed Majid

### Licensing Information

This model is released under the MIT License.

### Citation Information

```bibtex

@misc{arabic-mi-classifier,

  title={Arabic Message Classification Model},

  author={Ahmed Majid},

  year={2025},

  howpublished={Hugging Face Model Hub},

  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}

}

```

### Acknowledgments

- Based on the XLM-RoBERTa model by Facebook AI
- Fine-tuned from morit/arabic_xlm_xnli
- Trained on custom Arabic message dataset