Naqta (Arabic: ููุทุฉ, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of XLM-RoBERTa Large.
๐ก Try it live on the Hugging Face Space
โจ What Does It Restore?
| Symbol | Name | Example |
|---|---|---|
. |
Period | ููุงูุฉ ุงูุฌู ูุฉ |
ุ |
Arabic comma | ูุงุตูุฉ ุนุฑุจูุฉ |
ุ |
Arabic question mark | ุนูุงู ุฉ ุงุณุชููุงู |
! |
Exclamation mark | ุนูุงู ุฉ ุชุนุฌุจ |
: |
Colon | ููุทุชุงู |
ุ |
Arabic semicolon | ูุงุตูุฉ ู ูููุทุฉ |
- |
Dash | ุดุฑุทุฉ |
๐ Results
Validation Metrics (v11d โ Final)
| Metric | Score |
|---|---|
| ๐ฏ Macro F1 | 0.8960 |
| โ Accuracy | 0.9714 |
Per-Class F1 Score
| Class | Symbol | F1 | Performance |
|---|---|---|---|
| Exclamation | ! |
0.8897 | ๐ข Excellent |
| Arabic semicolon | ุ |
0.8042 | ๐ข Excellent |
| Question mark | ุ |
0.9665 | ๐ข Excellent |
| Dash | - |
0.9007 | ๐ข Excellent |
| Arabic comma | ุ |
0.8100 | ๐ข Excellent |
| Period | . |
0.8968 | ๐ข Excellent |
๐๏ธ Training Data
The model was trained on a large multi-source Arabic corpus totaling over 1.4 million paragraphs from six diverse sources, covering a broad range of Arabic writing styles and domains.
Corpus Sources
| Source | Rows | Domain |
|---|---|---|
| ABC / UNPC | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) |
| HF Tashkeel | ~151,000 | Vocalized Arabic text (diacritized corpus) |
| Hindawi E-Books | ~100,000 | Literary Arabic prose (novels & non-fiction) |
| Wikipedia (AR) | ~98,500 | Encyclopedia articles |
| CBT | ~69,000 | Classical Arabic books & religious texts |
| ARCD + XQuAD | ~2,050 | Arabic QA pairs (rich in question marks ุ) |
| Total (raw) | ~1,441,000 | โ |
All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g.,
ยซยป,โฆ, parentheses) was removed before training.
Punctuation Coverage (raw corpus)
| Mark | Name | Paragraphs | Coverage |
|---|---|---|---|
ุ |
Arabic comma | 922,721 | 64.0% |
: |
Colon | 230,150 | 16.0% |
ุ |
Arabic semicolon | 128,744 | 8.9% |
ุ |
Question mark | 50,282 | 3.5% |
! |
Exclamation | 15,976 | 1.1% |
- |
Dash | ~1 | <0.1% |
Data Balance Strategy
To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:
| Strategy | Marks | Multiplier | Cap |
|---|---|---|---|
| Strong oversampling | ุ ! |
ร8 | 80,000 rows |
| Light oversampling | ุ - |
ร6 | 80,000 rows |
After oversampling, the combined training pool grew to ~2.4 million paragraphs.
Dataset Splits
| Split | Sequences | Share |
|---|---|---|
| Train (capped) | 1,000,000 | 85% |
| Validation | 40,000 | 10% |
| Test | โ | 5% |
- Sliding-window context (window=3 sentences, stride=2) was applied to training data only
- Validation and test sets remain un-windowed for clean, unbiased evaluation
- Splits were stratified by the rarest punctuation mark in each sequence
Preprocessing
- Arabic normalization: alef variants โ
ุง, ya variants โู, diacritics stripped - Label assigned per word = punctuation mark following that word
- Multi-subword words: only the first subword receives the label; others are masked (
-100)
โ๏ธ Model Architecture & Training
| Setting | Value |
|---|---|
| Base model | xlm-roberta-large (~560M params) |
| Task | Token classification (8 labels) |
| Max sequence length | 384 tokens |
| Training examples | 1,000,000 (capped) |
| Validation examples | 40,000 |
Two-Phase Training
| Phase | Epochs | LR | Loss | Notes |
|---|---|---|---|---|
| Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning |
| Phase 2 | 1 | 6e-6 | Focal loss (ฮณ=2.0) + class weights | Bottom 12 layers frozen |
Class Weights
Rare class weights were additionally boosted:
| Class | Boost |
|---|---|
ุ |
ร1.2 |
! |
ร3.0 |
ุ |
ร2.0 |
- |
ร1.3 |
๐ Quick Start
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
repo_id = "MostafaMaroof/Naqta"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
model.eval()
id2label = model.config.id2label
text = "ุจูุบุช ูุณุจุฉ ุงููู
ู ุงูุงูุชุตุงุฏู 4.7 ุจุงูู
ุฆุฉ ุฎูุงู ุงูุฑุจุน ุงูุซุงูุซ ู
ู ุนุงู
2024 ููู ุงุนูู ู
ุณุชูู ู
ูุฐ ุฎู
ุณ ุณููุงุช"
words = text.split()
inputs = tokenizer(
words,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=384,
)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = logits.argmax(dim=-1)[0].tolist()
word_ids = inputs.word_ids(batch_index=0)
restored_words = []
previous_word_id = None
for token_id, word_id in zip(pred_ids, word_ids):
if word_id is None or word_id == previous_word_id:
continue
word = words[word_id]
label = id2label[token_id]
if label != "O":
word = word + label
restored_words.append(word)
previous_word_id = word_id
restored_text = " ".join(restored_words)
print(restored_text)
# โ ุจูุบุช ูุณุจุฉ ุงููู
ู ุงูุงูุชุตุงุฏู 4.7 ุจุงูู
ุฆุฉ ุฎูุงู ุงูุฑุจุน ุงูุซุงูุซ ู
ู ุนุงู
2024ุ ููู ุงุนูู ู
ุณุชูู ู
ูุฐ ุฎู
ุณ ุณููุงุช.
๐ Example
Input (unpunctuated):
ุงุฐุง ุงุฑุฏุช ุงู ุชูุฌุญ ูู ุญูุงุชู ูุนููู ุงู ุชุญุฏุฏ ุงูุฏุงูู ูุงุถุญุฉ ูุงู ุชุนู
ู ุจุฌุฏ ูุงุณุชู
ุฑุงุฑูุฉ ููุง ุชูุฃุณ ุนูุฏ ุงูู ุนูุจุฉ ุชูุงุฌููุง
Output (restored):
ุงุฐุง ุงุฑุฏุช ุงู ุชูุฌุญ ูู ุญูุงุชูุ ูุนููู ุงู ุชุญุฏุฏ ุงูุฏุงูู ูุงุถุญุฉุ ูุงู ุชุนู
ู ุจุฌุฏ ูุงุณุชู
ุฑุงุฑูุฉุ ููุง ุชูุฃุณ ุนูุฏ ุงูู ุนูุจุฉ ุชูุงุฌููุง.
Question example:
ู
ู ุงุฎุชุฑุน ุงููุงุชู ููู ุงู ุณูุฉ ุชู
ุฐูู ูู
ุง ูู ุงูู
ูุฉ ูุฐุง ุงูุงุฎุชุฑุงุน
ู
ู ุงุฎุชุฑุน ุงููุงุชูุ ููู ุงู ุณูุฉ ุชู
ุฐููุ ูู
ุง ูู ุงูู
ูุฉ ูุฐุง ุงูุงุฎุชุฑุงุนุ
๐ฏ Intended Use
Naqta is well-suited for:
- ๐๏ธ ASR post-processing โ restoring punctuation in Arabic speech transcripts
- ๐ Readability enhancement โ making raw Arabic text easier to read
- ๐ง NLP preprocessing โ improving text quality for downstream Arabic NLP tasks
- ๐ฌ Research โ Arabic punctuation restoration benchmark evaluation
โ ๏ธ Limitations
- Punctuation restoration is partly stylistic โ multiple valid outputs may exist for a single input.
- Performance may degrade on highly dialectal, technical, or domain-specific text.
- The model does not predict quotation marks or dialogue markers (
ยซยป). - Very short or fragmented text (< 5 words) may produce less reliable results.
- The model predicts punctuation position only and does not perform grammar correction.
๐ License
This model is released under the MIT License.
๐ Citation
If you use Naqta in your work, please reference:
@misc{naqta2025,
title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
author = {MostafaMaroof},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/MostafaMaroof/Naqta}
}
- Downloads last month
- 48
Model tree for MostafaMaroof/Naqta
Base model
FacebookAI/xlm-roberta-largeSpace using MostafaMaroof/Naqta 1
Evaluation results
- Validation Macro F1 on Mixed Arabic punctuation restoration corpusself-reported0.896
- Validation Accuracy on Mixed Arabic punctuation restoration corpusself-reported0.971