Alberta-MSA
Model Description
Alberta-MSA is a parameter-efficient Arabic BERT variant based on the ALBERT architecture [Lan et al., 2019]. It is pre-trained on Modern Standard Arabic (MSA) for Masked Language Modeling (MLM).
- Architecture: ALBERT-base (shared encoder weights across 12 layers).
- Dimensions: Embedding size 128 (projected to hidden size 768), 12 attention heads, intermediate size 3072.
- Tokenizer: 100,000 token vocabulary from MARBERT's WordPiece tokenizer [Abdul-Mageed and Elmadany, 2021].
- Parameters: ~20.3M.
- Performance: Retains 90-100% of BERT-MSA's F1 performance on downstream tasks [Devlin et al., 2019; Rogers et al., 2020].
Intended Use
General-purpose Arabic NLP backbone for fine-tuning on tasks such as:
- Text Classification
- Named Entity Recognition (NER)
- Question Answering (QA)
- Semantic Similarity
Suitable for resource-constrained environments due to reduced parameter count [Howard and Ruder, 2018; Gururangan et al., 2020].
Limitations
- Dialects: Optimized for MSA; may underperform on dialects without further adaptation [Kirkpatrick et al., 2017].
- Inference Latency: Slower on long sequences (>256 tokens) relative to BERT due to the embedding projection matrix operations [Lan et al., 2019].
- Objectives: No Sentence-Order Prediction (SOP) objective used during pre-training.
- Benchmarks: Tested on limited benchmarks; broader evaluation is required [Seelawi et al., 2021].
Training Data
Approximately 2.2B tokens from:
- Arabic Wikipedia (Nov 2023 dump, 1.22M articles) [El-Khair, 2016].
- Hindawi Books (~3,000 MSA books).
- El-Khair Corpus (5M news articles, 1.5B words) [El-Khair, 2016].
Preprocessing: Removal of English, diacritics, and extra spaces; split into 100-word chunks (>8 words kept) [Raffel et al., 2020].
Training Procedure
- Objective: MLM (15% tokens masked: 80%
[MASK], 10% random, 10% unchanged) using cross-entropy loss [Devlin et al., 2018]. - Hardware: TPUv4-8 (4 cores, DDP) [Sabne, 2020].
- Frameworks: PyTorch, PyTorch XLA, Lightning [Paszke et al., 2019].
- Optimizer: AdamW ($\text{lr}=1e^{-4}$, $\epsilon=1e^{-6}$, $\text{weight decay}=0.01$) [Liu et al., 2019].
- Scheduler: Linear warmup (1% steps) + decay.
- Configuration: Batch size 256; Sequence length 128; Steps 502,525 (25 epochs on 99% train split) [Hoffmann et al., 2022; Kaplan et al., 2020].
- Regularization: Dropout 0.1 [Srivastava et al., 2014].
Evaluation Results
Fine-tuned on PyTorch/Hugging Face with AdamW ($\text{lr}=2.5e^{-5}$, $\text{batch}=32-384$, $\text{epochs}=10$, early stop on macro-F1).
| Dataset | Task | F1 (Alberta-MSA) | F1 (BERT-MSA) | Retention (%) |
|---|---|---|---|---|
| AAFAQ [Essam et al., 2025] | Question Classification | 0.8835 | 0.8973 | 98.4 |
| ANERCorp [Benajiba et al., 2007] | NER (9 tags) | 0.6416 | 0.6739 | 95.23 |
| OSCAT [Seelawi et al., 2021] | Offensive Detection | 0.7459 | 0.8000 | 93.24 |
| IDAT [Seelawi et al., 2021] | Irony Detection | 0.7800 | 0.7947 | 98.13 |
| NSURL [Seelawi et al., 2021] | Semantic Similarity | 0.9385 | 0.9596 | 97.76 |
| ASERQA | QA (Span Extraction) | 0.5300 | 0.5289 | 100.19 |
Inference Speed (ONNX, CPU, Float32): Faster on short sequences (128 tokens: 18% speedup vs. BERT) but slower on long sequences (512 tokens: 47% slowdown) [Lan et al., 2019].
How to Use
from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer
import torch
model_name = "AbdallahhSaleh/Alberta-MSA"
# 1. Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# 2. Prepare Input
text = "السلام [MASK] الجميع."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Shape: (Batch=1, Seq_Len=6) -> e.g., [CLS, السلام, [MASK], الجميع, ., SEP]
# 3. Forward Pass (Data Flow)
with torch.no_grad():
outputs = model(**inputs)
# Hidden State Projection:
# Embedding (1, 6, 128) -> Projected (1, 6, 768) -> Layers -> Final Hidden (1, 6, 768)
# Logits Calculation:
logits = outputs.logits
# Shape: (Batch=1, Seq_Len=6, Vocab_Size=100000)
# 4. Decode Mask
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id))
# Output: "عليكم"
- Downloads last month
- 21
Model tree for AbdallahhSaleh/Alberta-MSA
Base model
albert/albert-base-v2