Alberta-MSA

Model Description

Alberta-MSA is a parameter-efficient Arabic BERT variant based on the ALBERT architecture [Lan et al., 2019]. It is pre-trained on Modern Standard Arabic (MSA) for Masked Language Modeling (MLM).

  • Architecture: ALBERT-base (shared encoder weights across 12 layers).
  • Dimensions: Embedding size 128 (projected to hidden size 768), 12 attention heads, intermediate size 3072.
  • Tokenizer: 100,000 token vocabulary from MARBERT's WordPiece tokenizer [Abdul-Mageed and Elmadany, 2021].
  • Parameters: ~20.3M.
  • Performance: Retains 90-100% of BERT-MSA's F1 performance on downstream tasks [Devlin et al., 2019; Rogers et al., 2020].

Intended Use

General-purpose Arabic NLP backbone for fine-tuning on tasks such as:

  • Text Classification
  • Named Entity Recognition (NER)
  • Question Answering (QA)
  • Semantic Similarity

Suitable for resource-constrained environments due to reduced parameter count [Howard and Ruder, 2018; Gururangan et al., 2020].

Limitations

  • Dialects: Optimized for MSA; may underperform on dialects without further adaptation [Kirkpatrick et al., 2017].
  • Inference Latency: Slower on long sequences (>256 tokens) relative to BERT due to the embedding projection matrix operations [Lan et al., 2019].
  • Objectives: No Sentence-Order Prediction (SOP) objective used during pre-training.
  • Benchmarks: Tested on limited benchmarks; broader evaluation is required [Seelawi et al., 2021].

Training Data

Approximately 2.2B tokens from:

  1. Arabic Wikipedia (Nov 2023 dump, 1.22M articles) [El-Khair, 2016].
  2. Hindawi Books (~3,000 MSA books).
  3. El-Khair Corpus (5M news articles, 1.5B words) [El-Khair, 2016].

Preprocessing: Removal of English, diacritics, and extra spaces; split into 100-word chunks (>8 words kept) [Raffel et al., 2020].

Training Procedure

  • Objective: MLM (15% tokens masked: 80% [MASK], 10% random, 10% unchanged) using cross-entropy loss [Devlin et al., 2018].
  • Hardware: TPUv4-8 (4 cores, DDP) [Sabne, 2020].
  • Frameworks: PyTorch, PyTorch XLA, Lightning [Paszke et al., 2019].
  • Optimizer: AdamW ($\text{lr}=1e^{-4}$, $\epsilon=1e^{-6}$, $\text{weight decay}=0.01$) [Liu et al., 2019].
  • Scheduler: Linear warmup (1% steps) + decay.
  • Configuration: Batch size 256; Sequence length 128; Steps 502,525 (25 epochs on 99% train split) [Hoffmann et al., 2022; Kaplan et al., 2020].
  • Regularization: Dropout 0.1 [Srivastava et al., 2014].

Evaluation Results

Fine-tuned on PyTorch/Hugging Face with AdamW ($\text{lr}=2.5e^{-5}$, $\text{batch}=32-384$, $\text{epochs}=10$, early stop on macro-F1).

Dataset Task F1 (Alberta-MSA) F1 (BERT-MSA) Retention (%)
AAFAQ [Essam et al., 2025] Question Classification 0.8835 0.8973 98.4
ANERCorp [Benajiba et al., 2007] NER (9 tags) 0.6416 0.6739 95.23
OSCAT [Seelawi et al., 2021] Offensive Detection 0.7459 0.8000 93.24
IDAT [Seelawi et al., 2021] Irony Detection 0.7800 0.7947 98.13
NSURL [Seelawi et al., 2021] Semantic Similarity 0.9385 0.9596 97.76
ASERQA QA (Span Extraction) 0.5300 0.5289 100.19

Inference Speed (ONNX, CPU, Float32): Faster on short sequences (128 tokens: 18% speedup vs. BERT) but slower on long sequences (512 tokens: 47% slowdown) [Lan et al., 2019].

How to Use

from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer
import torch

model_name = "AbdallahhSaleh/Alberta-MSA"

# 1. Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Prepare Input
text = "السلام [MASK] الجميع."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Shape: (Batch=1, Seq_Len=6) -> e.g., [CLS, السلام, [MASK], الجميع, ., SEP]

# 3. Forward Pass (Data Flow)
with torch.no_grad():
    outputs = model(**inputs)
    
    # Hidden State Projection:
    # Embedding (1, 6, 128) -> Projected (1, 6, 768) -> Layers -> Final Hidden (1, 6, 768)
    
    # Logits Calculation:
    logits = outputs.logits
    # Shape: (Batch=1, Seq_Len=6, Vocab_Size=100000)

# 4. Decode Mask
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id)) 
# Output: "عليكم"
Downloads last month
21
Safetensors
Model size
20.3M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AbdallahhSaleh/Alberta-MSA

Finetuned
(243)
this model