Arabic_MI_Classifier / model_card.md
ahmedmajid92's picture
Upload model_card.md
f717363 verified
metadata
language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
  - custom
tags:
  - arabic
  - text-classification
  - iraqi-dialect
  - msa
  - message-classification
  - xlm-roberta
widget:
  - text: السلام عليكم ورحمة الله وبركاته
    example_title: Arabic Greeting
  - text: شلونك اليوم؟
    example_title: Iraqi Question
  - text: عندي مشكلة بالانترنت
    example_title: Iraqi Complaint
  - text: أحب القراءة كثيراً
    example_title: General Statement
model-index:
  - name: Arabic_MI_Classifier
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          type: custom
          name: Arabic Messages Dataset
        metrics:
          - type: accuracy
            value: 0.95
            name: Accuracy

Arabic Message Classification Model

This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.

Model Description

  • Developed by: Ahmed Majid
  • Model type: XLM-RoBERTa for Sequence Classification
  • Language(s): Arabic (MSA and Iraqi dialect)
  • License: MIT
  • Finetuned from model: morit/arabic_xlm_xnli

Intended Uses

Direct Use

This model can be used for:

  • Classifying Arabic messages in customer service systems
  • Organizing Arabic text messages by intent
  • Building chatbots for Arabic-speaking users
  • Content moderation for Arabic forums and social media

Downstream Use

The model can be further fine-tuned for:

  • Other Arabic dialects
  • Domain-specific message classification
  • Multi-label classification tasks

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("السلام عليكم")
print(result)

Training Details

Training Data

The model was trained on a custom dataset of 5,000 Arabic messages:

  • 50% Modern Standard Arabic (MSA)
  • 50% Iraqi dialect
  • 4 classes: greeting, question, complaint, general
  • Balanced distribution: 1,250 examples per class

Training Procedure

Preprocessing

  • Tokenization using XLM-RoBERTa tokenizer
  • Maximum sequence length: 128 tokens
  • Padding and truncation applied

Training Hyperparameters

  • Training regime: fp32
  • Epochs: 20
  • Batch size: 8 (training), 16 (evaluation)
  • Learning rate: Default AdamW
  • Warmup steps: Not specified
  • Weight decay: Default
  • Optimizer: AdamW

Speeds, Sizes, Times

  • Model size: ~280M parameters
  • Training time: Approximately 2-3 hours on GPU
  • Inference time: ~50ms per message on GPU

Evaluation

Testing Data, Factors & Metrics

Testing Data

  • 10% of the custom dataset (500 examples)
  • Balanced across all 4 classes
  • Mix of MSA and Iraqi dialect

Factors

  • Dialects: MSA vs Iraqi dialect
  • Message length: Short to medium messages
  • Domain: General conversational messages

Metrics

  • Accuracy: Primary metric
  • Per-class performance: Evaluated for each label

Results

The model achieves good performance across all classes with particular strength in:

  • Greeting detection (both MSA and Iraqi)
  • Question identification
  • Complaint classification
  • General statement recognition

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Technical Specifications

Model Architecture and Objective

  • Architecture: XLM-RoBERTa with classification head
  • Objective: Multi-class text classification
  • Base model: morit/arabic_xlm_xnli
  • Classification head: 4 output classes

Compute Infrastructure

Hardware

  • GPU: NVIDIA GPU (recommended)
  • Memory: 8GB+ GPU memory recommended

Software

  • Framework: PyTorch
  • Libraries: Transformers, Datasets
  • Python version: 3.8+

Bias, Risks, and Limitations

Bias

  • The model may exhibit biases present in the training data
  • Performance may vary between different Arabic dialects
  • Regional variations in Iraqi dialect may not be fully captured

Risks

  • Misclassification of ambiguous messages
  • Potential cultural bias in greeting/complaint detection
  • Limited generalization to formal/informal register variations

Limitations

  • Only supports 4 predefined classes
  • Optimized for MSA and Iraqi dialect specifically
  • Maximum input length of 128 tokens
  • May not generalize well to other Arabic dialects

Additional Information

Author

Ahmed Majid

Licensing Information

This model is released under the MIT License.

Citation Information

@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}

Acknowledgments

  • Based on the XLM-RoBERTa model by Facebook AI
  • Fine-tuned from morit/arabic_xlm_xnli
  • Trained on custom Arabic message dataset