Arabic_MI_Classifier / model_card.md

ahmedmajid92

Upload model_card.md

f717363 verified 6 months ago

preview code

raw

history blame contribute delete

5.64 kB

metadata

language: ar
license: mit
library_name: transformers
pipeline_tag: text-classification
datasets:
  - custom
tags:
  - arabic
  - text-classification
  - iraqi-dialect
  - msa
  - message-classification
  - xlm-roberta
widget:
  - text: السلام عليكم ورحمة الله وبركاته
    example_title: Arabic Greeting
  - text: شلونك اليوم؟
    example_title: Iraqi Question
  - text: عندي مشكلة بالانترنت
    example_title: Iraqi Complaint
  - text: أحب القراءة كثيراً
    example_title: General Statement
model-index:
  - name: Arabic_MI_Classifier
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          type: custom
          name: Arabic Messages Dataset
        metrics:
          - type: accuracy
            value: 0.95
            name: Accuracy

Arabic Message Classification Model

This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.

Model Description

Developed by: Ahmed Majid
Model type: XLM-RoBERTa for Sequence Classification
Language(s): Arabic (MSA and Iraqi dialect)
License: MIT
Finetuned from model: morit/arabic_xlm_xnli

Intended Uses

Direct Use

This model can be used for:

Classifying Arabic messages in customer service systems
Organizing Arabic text messages by intent
Building chatbots for Arabic-speaking users
Content moderation for Arabic forums and social media

Downstream Use

The model can be further fine-tuned for:

Other Arabic dialects
Domain-specific message classification
Multi-label classification tasks

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

model_name = "ahmedmajid92/Arabic_MI_Classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classifier("السلام عليكم")
print(result)

Training Details

Training Data

The model was trained on a custom dataset of 5,000 Arabic messages:

50% Modern Standard Arabic (MSA)
50% Iraqi dialect
4 classes: greeting, question, complaint, general
Balanced distribution: 1,250 examples per class

Training Procedure

Preprocessing

Tokenization using XLM-RoBERTa tokenizer
Maximum sequence length: 128 tokens
Padding and truncation applied

Training Hyperparameters

Training regime: fp32
Epochs: 20
Batch size: 8 (training), 16 (evaluation)
Learning rate: Default AdamW
Warmup steps: Not specified
Weight decay: Default
Optimizer: AdamW

Speeds, Sizes, Times

Model size: ~280M parameters
Training time: Approximately 2-3 hours on GPU
Inference time: ~50ms per message on GPU

Evaluation

Testing Data, Factors & Metrics

Testing Data

10% of the custom dataset (500 examples)
Balanced across all 4 classes
Mix of MSA and Iraqi dialect

Factors

Dialects: MSA vs Iraqi dialect
Message length: Short to medium messages
Domain: General conversational messages

Metrics

Accuracy: Primary metric
Per-class performance: Evaluated for each label

Results

The model achieves good performance across all classes with particular strength in:

Greeting detection (both MSA and Iraqi)
Question identification
Complaint classification
General statement recognition

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Technical Specifications

Model Architecture and Objective

Architecture: XLM-RoBERTa with classification head
Objective: Multi-class text classification
Base model: morit/arabic_xlm_xnli
Classification head: 4 output classes

Compute Infrastructure

Hardware

GPU: NVIDIA GPU (recommended)
Memory: 8GB+ GPU memory recommended

Software

Framework: PyTorch
Libraries: Transformers, Datasets
Python version: 3.8+

Bias, Risks, and Limitations

Bias

The model may exhibit biases present in the training data
Performance may vary between different Arabic dialects
Regional variations in Iraqi dialect may not be fully captured

Risks

Misclassification of ambiguous messages
Potential cultural bias in greeting/complaint detection
Limited generalization to formal/informal register variations

Limitations

Only supports 4 predefined classes
Optimized for MSA and Iraqi dialect specifically
Maximum input length of 128 tokens
May not generalize well to other Arabic dialects

Additional Information

Author

Ahmed Majid

Licensing Information

This model is released under the MIT License.

Citation Information

@misc{arabic-mi-classifier,
  title={Arabic Message Classification Model},
  author={Ahmed Majid},
  year={2025},
  howpublished={Hugging Face Model Hub},
  url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
}

Acknowledgments

Based on the XLM-RoBERTa model by Facebook AI
Fine-tuned from morit/arabic_xlm_xnli
Trained on custom Arabic message dataset