B2BERT / README.md

Ali Mekky

Update README.md

fdf0710 verified about 1 year ago

1.62 kB

library_name: transformers
tags: []

Model Card for MDABERT

Model Details

Model Description

This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.

Developed by: Ali Mekky, Lara Hassan, Mohamed ElZeftawy
Institution: MBZUAI
Model type: Transformer-based multi-label classifier
Language(s) (NLP): Arabic (Dialectal Variants)
License: TBD
Finetuned from model: CAMeLBERT, MARBERT

Bias, Risks, and Limitations

Biases

Geographic bias in dataset annotation.
Overlapping dialects may result in misclassification.
Errors may arise from synthetic labels.

Recommendations

Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.

Training Details

Training Data

Datasets: NADI 2020, 2021, 2023, and 2024 development set.
Synthetic multi-label dataset created through pseudo-labeling.

Evaluation

Testing Data & Metrics

Testing Data: NADI 2024 Test set
Metrics: Macro F1-score, precision, recall
Link to NADI2024 Leaderboard https://huggingface.co/spaces/AMR-KELEG/NADI2024-leaderboard

Technical Specifications

Model Architecture and Objective

Transformer-based multi-label classifier for Arabic dialect identification.

Compute Infrastructure

Hardware: NVIDIA RTX 6000 (24GB VRAM)
Software: Python, PyTorch, Hugging Face Transformers