ShamiBERT / README.md

mabahboh

Upload ShamiBERT - Levantine Arabic BERT model

30ea70c verified about 1 month ago

preview code

raw

history blame contribute delete

4.03 kB

metadata

language:
  - ar
license: cc-by-nc-4.0
tags:
  - bert
  - Arabic BERT
  - Levantine Dialect
  - Shami
  - Syrian Arabic
  - Lebanese Arabic
  - Jordanian Arabic
  - Palestinian Arabic
  - Masked Language Model
  - Arabic NLP
datasets:
  - QCRI/arabic_pos_dialect
  - guymorlan/levanti
base_model: aubmindlab/bert-base-arabertv02-twitter
pipeline_tag: fill-mask

ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸

ShamiBERT is a BERT-based language model specialized for Levantine Arabic (الشامية) — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.

Model Description

ShamiBERT was created by performing continual pre-training on aubmindlab/bert-base-arabertv02-twitter using Masked Language Modeling (MLM) on Levantine Arabic text data.

Architecture

Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
Task: Masked Language Modeling (MLM)
Training: Continual pre-training on Levantine dialect data

Why AraBERT-Twitter as base?

Pre-trained on 77GB Arabic text + 60M Arabic tweets
Already handles dialectal Arabic and social media text
Supports emojis in vocabulary
Strong foundation for dialect-specific adaptation

Training Data

ShamiBERT was trained on a combination of Levantine Arabic datasets:

Dataset	Source	Description
QCRI Arabic POS (LEV)	HuggingFace	Levantine tweets with POS tags
Levanti	HuggingFace	Palestinian/Syrian/Lebanese/Jordanian sentences
Curated Shami	Manual	Hand-curated Levantine expressions and phrases

Training Details

Epochs: 5
Learning Rate: 2e-05
Batch Size: 128 (effective)
Max Sequence Length: 128
MLM Probability: 0.15
Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-6)
Weight Decay: 0.01
Warmup: 10%
Eval Perplexity: 5.04

Usage

Fill-Mask (تعبئة القناع)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")

# Shami examples
results = fill_mask("كيفك [MASK] الحمدلله")
for r in results[:3]:
    print(f"{r['token_str']} ({r['score']:.4f})")

Feature Extraction (for downstream tasks)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")

text = "شو أخبارك يا زلمة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]

Preprocessing (recommended)

from arabert.preprocess import ArabertPreprocessor

prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("كيفك يا خيي")

Intended Uses

ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:

Sentiment Analysis of Levantine social media
Text Classification (topic, dialect sub-identification)
Named Entity Recognition in Shami text
Feature Extraction for downstream tasks
Dialect Identification (Levantine vs other Arabic dialects)

Limitations

Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
Based on AraBERT-Twitter which was trained with max_length=64
Not suitable for MSA-heavy or non-Levantine dialect tasks

Citation

@misc{shamibert2026,
    title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
    year={2026},
    note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}

Acknowledgments

AraBERT team (AUB MIND Lab) for the base model
ArSyra team for Levantine dialect data
QCRI for Arabic dialect resources
Unsloth team for training optimizations