B2BERT / README.md
Ali Mekky
Update README.md
fdf0710 verified
|
raw
history blame
1.62 kB
metadata
library_name: transformers
tags: []

Model Card for MDABERT

Model Details

Model Description

This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.

  • Developed by: Ali Mekky, Lara Hassan, Mohamed ElZeftawy
  • Institution: MBZUAI
  • Model type: Transformer-based multi-label classifier
  • Language(s) (NLP): Arabic (Dialectal Variants)
  • License: TBD
  • Finetuned from model: CAMeLBERT, MARBERT

Bias, Risks, and Limitations

Biases

  • Geographic bias in dataset annotation.
  • Overlapping dialects may result in misclassification.
  • Errors may arise from synthetic labels.

Recommendations

Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.

Training Details

Training Data

  • Datasets: NADI 2020, 2021, 2023, and 2024 development set.
  • Synthetic multi-label dataset created through pseudo-labeling.

Evaluation

Testing Data & Metrics

Technical Specifications

Model Architecture and Objective

  • Transformer-based multi-label classifier for Arabic dialect identification.

Compute Infrastructure

  • Hardware: NVIDIA RTX 6000 (24GB VRAM)
  • Software: Python, PyTorch, Hugging Face Transformers