AHAAM
/

B2BERT

@@ -1,6 +1,6 @@
 ---
 library_name: transformers
-tags: []
 ---
 # Model Card for B2BERT
@@ -9,57 +9,91 @@ tags: []
 ### Model Description
-This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.
-<!-- - **Developed by:** Ali Mekky, Lara Hassan, Mohamed ElZeftawy
-- **Institution:** MBZUAI -->
-- **Model type:** Transformer-based multi-label classifier
-- **Language(s) (NLP):** Arabic (Dialectal Variants)
-- **License:** TBD
-- **Finetuned from model:** CAMeLBERT
 ## Bias, Risks, and Limitations
 ### Biases
-- Geographic bias in dataset annotation.
-- Overlapping dialects may result in misclassification.
-- Errors may arise from synthetic labels.
 ### Recommendations
-Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.
 ## Training Details
 ### Training Data
-- **Datasets:** NADI 2020, 2021, 2023, and 2024 development set.
-- **Synthetic multi-label dataset** created through pseudo-labeling.
 ## Evaluation
 ### Testing Data & Metrics
-- **Testing Data:** NADI 2024 Test set
-- **Metrics:** Macro F1-score, precision, recall
-- **Link to NADI2024 Leaderboard** https://huggingface.co/spaces/AMR-KELEG/MLADI
 ## Technical Specifications
 ### Model Architecture and Objective
-- Transformer-based multi-label classifier for Arabic dialect identification.
 ### Compute Infrastructure
-- **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
-- **Software:** Python, PyTorch, Hugging Face Transformers
 ## Using the Model
@@ -123,3 +157,8 @@ print(f"Predicted Dialects: {predicted_dialects}")
 ```

 ---
 library_name: transformers
+tags: [Arabic, Dialect Identification, Multi-Label, BERT, MLADI, NLP]
 ---
 # Model Card for B2BERT
 ### Model Description
+B2BERT is a lightweight transformer-based model for **Multi-Label Arabic Dialect Identification (MLADI)**. Unlike traditional single-label approaches, MLADI captures the natural overlap between dialects, allowing a sentence to be associated with multiple dialects at once.
+- **Model type:** Transformer-based multi-label classifier
+- **Finetuned from model:** CAMeLBERT (~110M parameters)
+- **Language(s) (NLP):** Arabic (Dialectal Variants, 18 dialects)
+- **License:** TBD
+**Key Innovations:**
+- **Knowledge Distillation:** Multi-label annotations generated by GPT-4o, capturing real-world ambiguity.
+- **Curriculum Learning:** Samples introduced progressively by label cardinality, mitigating imbalance and improving generalization.
+- **Preprocessing:** Normalization (Alef variants), diacritics/emoji/punctuation removal, anonymization of mentions and URLs, mixed-language handling, and stopword removal.
+---
 ## Bias, Risks, and Limitations
 ### Biases
+- Geographic bias in dataset annotation (labels based on user location).
+- Overlapping dialects may result in misclassification, especially in dense regions like Maghreb and Levant.
+- Errors may arise from synthetic labels (pseudo-labeling from GPT-4o).
 ### Recommendations
+Users should validate outputs before deploying in high-stakes or production settings, especially where dialect precision is critical.
+---
 ## Training Details
 ### Training Data
+- **Datasets:** NADI 2020, 2021, 2023, and NADI 2024 development set.
+- **Synthetic dataset:** Converted to multi-label using GPT-4o pseudo-labeling.
+- **Dialects Covered (18):** Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Sudan, Syria, Tunisia, UAE, Yemen.
+### Training Procedure
+Training Procedure
+- **Knowledge Distillation:**  Pseudo-labels generated by GPT-4o.
+- **Curriculum Learning:** Training samples organized by label cardinality (from single to multi-label) to improve robustness without undersampling. Progressively exposed to examples with increasing label combinations (to balance between different cardinalities in each epoch), while reinforcing simpler cases.
+**Hyperparameters:**
+- Optimizer: AdamW
+- Learning Rate: 1e-5
+- Dropout: 0.3
+- Batch Size: 11
+- Epochs: 10
+- Hardware: NVIDIA RTX 6000 (24GB VRAM)
+- Training Time: ~27 minutes per run
+---
 ## Evaluation
 ### Testing Data & Metrics
+- **Test Data:** Official MLADI test set (not public, NADI 2024 shared task).
+- **Metrics:** Macro F1-score, precision, recall.
+**Leaderboard Results (MLADI, NADI 2024):**
+| Model                 | Macro F1 | Precision | Recall  |
+|------------------------|----------|-----------|---------|
+| NADI 2024 Baseline     | 0.4698   | 0.6480    | 0.3986 |
+| ELYADATA (best team)   | 0.5240   | 0.5015    | 0.5687 |
+| Aya Expanse 32B        | 0.5447   | 0.4945    | 0.6451 |
+| ALLaM 7B Instruct      | 0.2506   | 0.5791    | 0.1639 |
+| **B2BERT (ours)**      | **0.5963** | 0.5818    | 0.6976 |
+- **Strengths:** Strong performance on Gulf, MSA, and Egyptian dialects (F1 > 0.70).
+- **Weaknesses:** Lower performance on Maghrebi, Levantine, and Nile Valley dialects due to overlap.
+**Link to MLADI Leaderboard:** [Hugging Face Space](https://huggingface.co/spaces/AMR-KELEG/MLADI)
+---
 ## Technical Specifications
 ### Model Architecture and Objective
+- Transformer-based multi-label classifier (BERT backbone).
+- Outputs sigmoid activations per dialect, allowing multi-label predictions.
 ### Compute Infrastructure
+- **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
+- **Software:** Python, PyTorch, Hugging Face Transformers
 ## Using the Model
 ```
+## Credits to
+- Ali Mekky: ali.mekky@mbzuai.ac.ae
+- Mohamed ElZeftawy: mohamed.elzeftawy@mbzuai.ac.ae
+- Lara Hassan: lara.hassan@mbzuai.ac.ae