metadata
library_name: transformers
tags: []
Model Card for MDABERT
Model Details
Model Description
This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.
- Developed by: Ali Mekky, Lara Hassan, Mohamed ElZeftawy
- Institution: MBZUAI
- Model type: Transformer-based multi-label classifier
- Language(s) (NLP): Arabic (Dialectal Variants)
- License: TBD
- Finetuned from model: CAMeLBERT, MARBERT
Bias, Risks, and Limitations
Biases
- Geographic bias in dataset annotation.
- Overlapping dialects may result in misclassification.
- Errors may arise from synthetic labels.
Recommendations
Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.
Training Details
Training Data
- Datasets: NADI 2020, 2021, 2023, and 2024 development set.
- Synthetic multi-label dataset created through pseudo-labeling.
Evaluation
Testing Data & Metrics
- Testing Data: NADI 2024 Test set
- Metrics: Macro F1-score, precision, recall
- Link to NADI2024 Leaderboard https://huggingface.co/spaces/AMR-KELEG/NADI2024-leaderboard
Technical Specifications
Model Architecture and Objective
- Transformer-based multi-label classifier for Arabic dialect identification.
Compute Infrastructure
- Hardware: NVIDIA RTX 6000 (24GB VRAM)
- Software: Python, PyTorch, Hugging Face Transformers