Upload ShamiBERT - Levantine Arabic BERT model

30ea70c verified about 1 month ago

4.03 kB

	---
	language:
	- ar
	license: cc-by-nc-4.0
	tags:
	- bert
	- Arabic BERT
	- Levantine Dialect
	- Shami
	- Syrian Arabic
	- Lebanese Arabic
	- Jordanian Arabic
	- Palestinian Arabic
	- Masked Language Model
	- Arabic NLP
	datasets:
	- QCRI/arabic_pos_dialect
	- guymorlan/levanti
	base_model: aubmindlab/bert-base-arabertv02-twitter
	pipeline_tag: fill-mask
	---

	# ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸

	ShamiBERT is a BERT-based language model specialized for Levantine Arabic (الشامية) — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.

	## Model Description

	ShamiBERT was created by performing continual pre-training on `aubmindlab/bert-base-arabertv02-twitter` using Masked Language Modeling (MLM) on Levantine Arabic text data.

	### Architecture
	- Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
	- Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
	- Task: Masked Language Modeling (MLM)
	- Training: Continual pre-training on Levantine dialect data

	### Why AraBERT-Twitter as base?
	1. Pre-trained on 77GB Arabic text + 60M Arabic tweets
	2. Already handles dialectal Arabic and social media text
	3. Supports emojis in vocabulary
	4. Strong foundation for dialect-specific adaptation

	## Training Data

	ShamiBERT was trained on a combination of Levantine Arabic datasets:

	\| Dataset \| Source \| Description \|
	\|---------\|--------\|-------------\|
	\| QCRI Arabic POS (LEV) \| HuggingFace \| Levantine tweets with POS tags \|
	\| Levanti \| HuggingFace \| Palestinian/Syrian/Lebanese/Jordanian sentences \|
	\| Curated Shami \| Manual \| Hand-curated Levantine expressions and phrases \|

	### Training Details
	- Epochs: 5
	- Learning Rate: 2e-05
	- Batch Size: 128 (effective)
	- Max Sequence Length: 128
	- MLM Probability: 0.15
	- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-6)
	- Weight Decay: 0.01
	- Warmup: 10%
	- Eval Perplexity: 5.04

	## Usage

	### Fill-Mask (تعبئة القناع)
	```python
	from transformers import pipeline

	fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")

	# Shami examples
	results = fill_mask("كيفك [MASK] الحمدلله")
	for r in results[:3]:
	print(f"{r['token_str']} ({r['score']:.4f})")
	```

	### Feature Extraction (for downstream tasks)
	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
	model = AutoModel.from_pretrained("mabahboh/ShamiBERT")

	text = "شو أخبارك يا زلمة"
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# Use [CLS] token embedding for classification
	cls_embedding = outputs.last_hidden_state[:, 0, :]
	```

	### Preprocessing (recommended)
	```python
	from arabert.preprocess import ArabertPreprocessor

	prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
	text = prep.preprocess("كيفك يا خيي")
	```

	## Intended Uses

	ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:
	- Sentiment Analysis of Levantine social media
	- Text Classification (topic, dialect sub-identification)
	- Named Entity Recognition in Shami text
	- Feature Extraction for downstream tasks
	- Dialect Identification (Levantine vs other Arabic dialects)

	## Limitations

	- Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
	- Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
	- Based on AraBERT-Twitter which was trained with max_length=64
	- Not suitable for MSA-heavy or non-Levantine dialect tasks

	## Citation

	```bibtex
	@misc{shamibert2026,
	title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
	year={2026},
	note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
	}
	```

	## Acknowledgments

	- AraBERT team (AUB MIND Lab) for the base model
	- ArSyra team for Levantine dialect data
	- QCRI for Arabic dialect resources
	- Unsloth team for training optimizations