Alberta-MSA

Model Description

Alberta-MSA is a parameter-efficient Arabic BERT variant based on the ALBERT architecture [Lan et al., 2019]. It is pre-trained on Modern Standard Arabic (MSA) for Masked Language Modeling (MLM).

Architecture: ALBERT-base (shared encoder weights across 12 layers).
Dimensions: Embedding size 128 (projected to hidden size 768), 12 attention heads, intermediate size 3072.
Tokenizer: 100,000 token vocabulary from MARBERT's WordPiece tokenizer [Abdul-Mageed and Elmadany, 2021].
Parameters: ~20.3M.
Performance: Retains 90-100% of BERT-MSA's F1 performance on downstream tasks [Devlin et al., 2019; Rogers et al., 2020].

Intended Use

General-purpose Arabic NLP backbone for fine-tuning on tasks such as:

Text Classification
Named Entity Recognition (NER)
Question Answering (QA)
Semantic Similarity

Suitable for resource-constrained environments due to reduced parameter count [Howard and Ruder, 2018; Gururangan et al., 2020].

Limitations

Dialects: Optimized for MSA; may underperform on dialects without further adaptation [Kirkpatrick et al., 2017].
Inference Latency: Slower on long sequences (>256 tokens) relative to BERT due to the embedding projection matrix operations [Lan et al., 2019].
Objectives: No Sentence-Order Prediction (SOP) objective used during pre-training.
Benchmarks: Tested on limited benchmarks; broader evaluation is required [Seelawi et al., 2021].

Training Data

Approximately 2.2B tokens from:

Arabic Wikipedia (Nov 2023 dump, 1.22M articles) [El-Khair, 2016].
Hindawi Books (~3,000 MSA books).
El-Khair Corpus (5M news articles, 1.5B words) [El-Khair, 2016].

Preprocessing: Removal of English, diacritics, and extra spaces; split into 100-word chunks (>8 words kept) [Raffel et al., 2020].

Training Procedure

Objective: MLM (15% tokens masked: 80% [MASK], 10% random, 10% unchanged) using cross-entropy loss [Devlin et al., 2018].
Hardware: TPUv4-8 (4 cores, DDP) [Sabne, 2020].
Frameworks: PyTorch, PyTorch XLA, Lightning [Paszke et al., 2019].
Optimizer: AdamW ($\text{lr}=1e^{-4}$, $\epsilon=1e^{-6}$, $\text{weight decay}=0.01$) [Liu et al., 2019].
Scheduler: Linear warmup (1% steps) + decay.
Configuration: Batch size 256; Sequence length 128; Steps 502,525 (25 epochs on 99% train split) [Hoffmann et al., 2022; Kaplan et al., 2020].
Regularization: Dropout 0.1 [Srivastava et al., 2014].

Evaluation Results

Fine-tuned on PyTorch/Hugging Face with AdamW ($\text{lr}=2.5e^{-5}$, $\text{batch}=32-384$, $\text{epochs}=10$, early stop on macro-F1).

Dataset	Task	F1 (Alberta-MSA)	F1 (BERT-MSA)	Retention (%)
AAFAQ [Essam et al., 2025]	Question Classification	0.8835	0.8973	98.4
ANERCorp [Benajiba et al., 2007]	NER (9 tags)	0.6416	0.6739	95.23
OSCAT [Seelawi et al., 2021]	Offensive Detection	0.7459	0.8000	93.24
IDAT [Seelawi et al., 2021]	Irony Detection	0.7800	0.7947	98.13
NSURL [Seelawi et al., 2021]	Semantic Similarity	0.9385	0.9596	97.76
ASERQA	QA (Span Extraction)	0.5300	0.5289	100.19

Inference Speed (ONNX, CPU, Float32): Faster on short sequences (128 tokens: 18% speedup vs. BERT) but slower on long sequences (512 tokens: 47% slowdown) [Lan et al., 2019].

How to Use

from transformers import pipeline, AutoModelForMaskedLM, AutoTokenizer
import torch

model_name = "AbdallahhSaleh/Alberta-MSA"

# 1. Load Tokenizer and Model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# 2. Prepare Input
text = "السلام [MASK] الجميع."
# Tokenize input
inputs = tokenizer(text, return_tensors="pt")
# Shape: (Batch=1, Seq_Len=6) -> e.g., [CLS, السلام, [MASK], الجميع, ., SEP]

# 3. Forward Pass (Data Flow)
with torch.no_grad():
    outputs = model(**inputs)
    
    # Hidden State Projection:
    # Embedding (1, 6, 128) -> Projected (1, 6, 768) -> Layers -> Final Hidden (1, 6, 768)
    
    # Logits Calculation:
    logits = outputs.logits
    # Shape: (Batch=1, Seq_Len=6, Vocab_Size=100000)

# 4. Decode Mask
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
print(tokenizer.decode(predicted_token_id)) 
# Output: "عليكم"

Downloads last month: 2

Safetensors

Model size

20.3M params

Tensor type

F32

Model tree for AbdallahhSaleh/Alberta-MSA

Base model

albert/albert-base-v2

Finetuned

(266)

this model