AHAAM commited on
Commit
18443bf
·
verified ·
1 Parent(s): 8fb261b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -24
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
  # Model Card for B2BERT
@@ -9,57 +9,91 @@ tags: []
9
 
10
  ### Model Description
11
 
12
- This is the model card for the Multi-Label country-level Dialect Identification (ML-DID) model using CAMeLBERT. It classifies Arabic text into multiple dialectal categories using pseudo-labeling and curriculum-based training.
13
 
14
- <!-- - **Developed by:** Ali Mekky, Lara Hassan, Mohamed ElZeftawy
15
- - **Institution:** MBZUAI -->
16
- - **Model type:** Transformer-based multi-label classifier
17
- - **Language(s) (NLP):** Arabic (Dialectal Variants)
18
- - **License:** TBD
19
- - **Finetuned from model:** CAMeLBERT
20
 
 
 
 
 
 
 
21
 
22
  ## Bias, Risks, and Limitations
23
 
24
  ### Biases
25
-
26
- - Geographic bias in dataset annotation.
27
- - Overlapping dialects may result in misclassification.
28
- - Errors may arise from synthetic labels.
29
 
30
  ### Recommendations
 
31
 
32
- Users should be aware of biases in dataset annotation and carefully validate outputs for high-stakes applications.
33
-
34
 
35
  ## Training Details
36
 
37
  ### Training Data
 
 
 
 
 
 
 
 
 
 
 
38
 
39
- - **Datasets:** NADI 2020, 2021, 2023, and 2024 development set.
40
- - **Synthetic multi-label dataset** created through pseudo-labeling.
41
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ## Evaluation
44
 
45
  ### Testing Data & Metrics
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- - **Testing Data:** NADI 2024 Test set
48
- - **Metrics:** Macro F1-score, precision, recall
49
- - **Link to NADI2024 Leaderboard** https://huggingface.co/spaces/AMR-KELEG/MLADI
50
 
 
51
 
 
52
 
53
  ## Technical Specifications
54
 
55
  ### Model Architecture and Objective
56
-
57
- - Transformer-based multi-label classifier for Arabic dialect identification.
58
 
59
  ### Compute Infrastructure
60
-
61
- - **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
62
- - **Software:** Python, PyTorch, Hugging Face Transformers
63
 
64
  ## Using the Model
65
 
@@ -123,3 +157,8 @@ print(f"Predicted Dialects: {predicted_dialects}")
123
 
124
 
125
  ```
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags: [Arabic, Dialect Identification, Multi-Label, BERT, MLADI, NLP]
4
  ---
5
 
6
  # Model Card for B2BERT
 
9
 
10
  ### Model Description
11
 
12
+ B2BERT is a lightweight transformer-based model for **Multi-Label Arabic Dialect Identification (MLADI)**. Unlike traditional single-label approaches, MLADI captures the natural overlap between dialects, allowing a sentence to be associated with multiple dialects at once.
13
 
14
+ - **Model type:** Transformer-based multi-label classifier
15
+ - **Finetuned from model:** CAMeLBERT (~110M parameters)
16
+ - **Language(s) (NLP):** Arabic (Dialectal Variants, 18 dialects)
17
+ - **License:** TBD
 
 
18
 
19
+ **Key Innovations:**
20
+ - **Knowledge Distillation:** Multi-label annotations generated by GPT-4o, capturing real-world ambiguity.
21
+ - **Curriculum Learning:** Samples introduced progressively by label cardinality, mitigating imbalance and improving generalization.
22
+ - **Preprocessing:** Normalization (Alef variants), diacritics/emoji/punctuation removal, anonymization of mentions and URLs, mixed-language handling, and stopword removal.
23
+
24
+ ---
25
 
26
  ## Bias, Risks, and Limitations
27
 
28
  ### Biases
29
+ - Geographic bias in dataset annotation (labels based on user location).
30
+ - Overlapping dialects may result in misclassification, especially in dense regions like Maghreb and Levant.
31
+ - Errors may arise from synthetic labels (pseudo-labeling from GPT-4o).
 
32
 
33
  ### Recommendations
34
+ Users should validate outputs before deploying in high-stakes or production settings, especially where dialect precision is critical.
35
 
36
+ ---
 
37
 
38
  ## Training Details
39
 
40
  ### Training Data
41
+ - **Datasets:** NADI 2020, 2021, 2023, and NADI 2024 development set.
42
+ - **Synthetic dataset:** Converted to multi-label using GPT-4o pseudo-labeling.
43
+ - **Dialects Covered (18):** Algeria, Bahrain, Egypt, Iraq, Jordan, Kuwait, Lebanon, Libya, Morocco, Oman, Palestine, Qatar, Saudi Arabia, Sudan, Syria, Tunisia, UAE, Yemen.
44
+
45
+ ### Training Procedure
46
+
47
+ Training Procedure
48
+ - **Knowledge Distillation:** Pseudo-labels generated by GPT-4o.
49
+
50
+ - **Curriculum Learning:** Training samples organized by label cardinality (from single to multi-label) to improve robustness without undersampling. Progressively exposed to examples with increasing label combinations (to balance between different cardinalities in each epoch), while reinforcing simpler cases.
51
+
52
 
 
 
53
 
54
+ **Hyperparameters:**
55
+ - Optimizer: AdamW
56
+ - Learning Rate: 1e-5
57
+ - Dropout: 0.3
58
+ - Batch Size: 11
59
+ - Epochs: 10
60
+ - Hardware: NVIDIA RTX 6000 (24GB VRAM)
61
+ - Training Time: ~27 minutes per run
62
+
63
+ ---
64
 
65
  ## Evaluation
66
 
67
  ### Testing Data & Metrics
68
+ - **Test Data:** Official MLADI test set (not public, NADI 2024 shared task).
69
+ - **Metrics:** Macro F1-score, precision, recall.
70
+
71
+ **Leaderboard Results (MLADI, NADI 2024):**
72
+
73
+ | Model | Macro F1 | Precision | Recall |
74
+ |------------------------|----------|-----------|---------|
75
+ | NADI 2024 Baseline | 0.4698 | 0.6480 | 0.3986 |
76
+ | ELYADATA (best team) | 0.5240 | 0.5015 | 0.5687 |
77
+ | Aya Expanse 32B | 0.5447 | 0.4945 | 0.6451 |
78
+ | ALLaM 7B Instruct | 0.2506 | 0.5791 | 0.1639 |
79
+ | **B2BERT (ours)** | **0.5963** | 0.5818 | 0.6976 |
80
 
81
+ - **Strengths:** Strong performance on Gulf, MSA, and Egyptian dialects (F1 > 0.70).
82
+ - **Weaknesses:** Lower performance on Maghrebi, Levantine, and Nile Valley dialects due to overlap.
 
83
 
84
+ **Link to MLADI Leaderboard:** [Hugging Face Space](https://huggingface.co/spaces/AMR-KELEG/MLADI)
85
 
86
+ ---
87
 
88
  ## Technical Specifications
89
 
90
  ### Model Architecture and Objective
91
+ - Transformer-based multi-label classifier (BERT backbone).
92
+ - Outputs sigmoid activations per dialect, allowing multi-label predictions.
93
 
94
  ### Compute Infrastructure
95
+ - **Hardware:** NVIDIA RTX 6000 (24GB VRAM)
96
+ - **Software:** Python, PyTorch, Hugging Face Transformers
 
97
 
98
  ## Using the Model
99
 
 
157
 
158
 
159
  ```
160
+
161
+ ## Credits to
162
+ - Ali Mekky: ali.mekky@mbzuai.ac.ae
163
+ - Mohamed ElZeftawy: mohamed.elzeftawy@mbzuai.ac.ae
164
+ - Lara Hassan: lara.hassan@mbzuai.ac.ae