Arabic Punctuation Restoration โ AraBERT
Fine-tuned AraBERT v0.2 for automatic punctuation restoration in Arabic text.
The model takes unpunctuated Arabic text as input and predicts the punctuation mark that should follow each word.
Labels
| ID | Symbol | Description |
|---|---|---|
| 0 | โ | No punctuation |
| 1 | . |
Period |
| 2 | ุ |
Arabic comma |
| 3 | ุ |
Question mark |
| 4 | ! |
Exclamation mark |
| 5 | ุ |
Arabic semicolon |
| 6 | : |
Colon |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model = AutoModelForTokenClassification.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model.eval()
text = "ุฐูุจ ุงูููุฏ ุฅูู ุงูู
ุฏุฑุณุฉ ูุนุงุฏ ูู ุงูู
ุณุงุก"
words = text.split()
encoding = tokenizer(
words,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
padding=True
)
with torch.no_grad():
logits = model(**encoding).logits
preds = logits.argmax(dim=-1)[0]
word_ids = encoding.word_ids(batch_index=0)
id2label = {0: "", 1: ".", 2: "ุ", 3: "ุ", 4: "!", 5: "ุ", 6: ":"}
result = []
prev = None
for idx, word_idx in enumerate(word_ids):
if word_idx is None or word_idx == prev:
continue
result.append(words[word_idx] + id2label[preds[idx].item()])
prev = word_idx
print(" ".join(result))
Training
Dataset: SSAC-UNPC โ a large-scale Arabic corpus of UN proceedings.
The dataset was built with a custom pipeline to address severe class imbalance:
- Extracted all sentences containing rare punctuation (
!,ุ) - Extracted sentences rich in comma and semicolon combinations
- Final training set: ~400,000 sentences
Training setup:
- Base model:
aubmindlab/bert-base-arabertv02 - Optimizer: AdamW (lr = 2e-5)
- Scheduler: Linear warmup (10%) + linear decay
- Loss: Weighted CrossEntropyLoss
- Mixed precision: AMP (FP16)
- Early stopping: patience = 2
- Epochs: 3 | Batch size: 16 | Max length: 128
Two-Stage Decision for Arabic Comma:
The model uses a custom post-processing step for the comma class (ุ). If the comma probability exceeds a confidence threshold (ฯ = 0.70) and is higher than the no-punctuation probability, the model predicts comma. Otherwise, it selects the best non-comma alternative. This improved comma F1 from 0.685 โ 0.749.
Results
Evaluated on a held-out validation set (~1.2M tokens):
| Class | Precision | Recall | F1 |
|---|---|---|---|
| O (no punct) | 0.995 | 0.964 | 0.979 |
. Period |
0.993 | 0.999 | 0.996 |
ุ Comma |
0.646 | 0.891 | 0.749 |
ุ Question |
0.955 | 0.965 | 0.960 |
! Exclamation |
0.520 | 0.361 | 0.426 |
ุ Semicolon |
0.432 | 0.792 | 0.559 |
: Colon |
0.705 | 0.919 | 0.798 |
| Weighted Avg | 0.970 | 0.960 | 0.963 |
Repository
Full training code and data pipeline: github.com/MakdadTaleb/arabic-punctuation-restoration
- Downloads last month
- 21
Model tree for makdadTaleb/arabic-punctuation-arabert
Base model
aubmindlab/bert-base-arabertv02