Arabic_MI_Classifier / model_card.md

Upload model_card.md

f717363 verified 6 months ago

5.64 kB

	---
	language: ar
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	datasets:
	- custom
	tags:
	- arabic
	- text-classification
	- iraqi-dialect
	- msa
	- message-classification
	- xlm-roberta
	widget:
	- text: "السلام عليكم ورحمة الله وبركاته"
	example_title: "Arabic Greeting"
	- text: "شلونك اليوم؟"
	example_title: "Iraqi Question"
	- text: "عندي مشكلة بالانترنت"
	example_title: "Iraqi Complaint"
	- text: "أحب القراءة كثيراً"
	example_title: "General Statement"
	model-index:
	- name: Arabic_MI_Classifier
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	type: custom
	name: Arabic Messages Dataset
	metrics:
	- type: accuracy
	value: 0.95
	name: Accuracy
	---

	# Arabic Message Classification Model

	This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect.

	## Model Description

	- Developed by: Ahmed Majid
	- Model type: XLM-RoBERTa for Sequence Classification
	- Language(s): Arabic (MSA and Iraqi dialect)
	- License: MIT
	- Finetuned from model: morit/arabic_xlm_xnli

	## Intended Uses

	### Direct Use

	This model can be used for:
	- Classifying Arabic messages in customer service systems
	- Organizing Arabic text messages by intent
	- Building chatbots for Arabic-speaking users
	- Content moderation for Arabic forums and social media

	### Downstream Use

	The model can be further fine-tuned for:
	- Other Arabic dialects
	- Domain-specific message classification
	- Multi-label classification tasks

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	model_name = "ahmedmajid92/Arabic_MI_Classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
	result = classifier("السلام عليكم")
	print(result)
	```

	## Training Details

	### Training Data

	The model was trained on a custom dataset of 5,000 Arabic messages:
	- 50% Modern Standard Arabic (MSA)
	- 50% Iraqi dialect
	- 4 classes: greeting, question, complaint, general
	- Balanced distribution: 1,250 examples per class

	### Training Procedure

	#### Preprocessing

	- Tokenization using XLM-RoBERTa tokenizer
	- Maximum sequence length: 128 tokens
	- Padding and truncation applied

	#### Training Hyperparameters

	- Training regime: fp32
	- Epochs: 20
	- Batch size: 8 (training), 16 (evaluation)
	- Learning rate: Default AdamW
	- Warmup steps: Not specified
	- Weight decay: Default
	- Optimizer: AdamW

	#### Speeds, Sizes, Times

	- Model size: ~280M parameters
	- Training time: Approximately 2-3 hours on GPU
	- Inference time: ~50ms per message on GPU

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	- 10% of the custom dataset (500 examples)
	- Balanced across all 4 classes
	- Mix of MSA and Iraqi dialect

	#### Factors

	- Dialects: MSA vs Iraqi dialect
	- Message length: Short to medium messages
	- Domain: General conversational messages

	#### Metrics

	- Accuracy: Primary metric
	- Per-class performance: Evaluated for each label

	### Results

	The model achieves good performance across all classes with particular strength in:
	- Greeting detection (both MSA and Iraqi)
	- Question identification
	- Complaint classification
	- General statement recognition

	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: XLM-RoBERTa with classification head
	- Objective: Multi-class text classification
	- Base model: morit/arabic_xlm_xnli
	- Classification head: 4 output classes

	### Compute Infrastructure

	#### Hardware

	- GPU: NVIDIA GPU (recommended)
	- Memory: 8GB+ GPU memory recommended

	#### Software

	- Framework: PyTorch
	- Libraries: Transformers, Datasets
	- Python version: 3.8+

	## Bias, Risks, and Limitations

	### Bias

	- The model may exhibit biases present in the training data
	- Performance may vary between different Arabic dialects
	- Regional variations in Iraqi dialect may not be fully captured

	### Risks

	- Misclassification of ambiguous messages
	- Potential cultural bias in greeting/complaint detection
	- Limited generalization to formal/informal register variations

	### Limitations

	- Only supports 4 predefined classes
	- Optimized for MSA and Iraqi dialect specifically
	- Maximum input length of 128 tokens
	- May not generalize well to other Arabic dialects

	## Additional Information

	### Author

	Ahmed Majid

	### Licensing Information

	This model is released under the MIT License.

	### Citation Information

	```bibtex
	@misc{arabic-mi-classifier,
	title={Arabic Message Classification Model},
	author={Ahmed Majid},
	year={2025},
	howpublished={Hugging Face Model Hub},
	url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
	}
	```

	### Acknowledgments

	- Based on the XLM-RoBERTa model by Facebook AI
	- Fine-tuned from morit/arabic_xlm_xnli
	- Trained on custom Arabic message dataset