---
language:
- ar
license: mit
tags:
- arabic
- punctuation-restoration
- token-classification
- xlm-roberta
- natural-language-processing
pipeline_tag: token-classification
base_model: xlm-roberta-large
model-index:
- name: Naqta
results:
- task:
type: token-classification
name: Arabic Punctuation Restoration
dataset:
name: Mixed Arabic punctuation restoration corpus
type: custom
metrics:
- type: f1
value: 0.8960
name: Validation Macro F1
- type: accuracy
value: 0.9714
name: Validation Accuracy
---
# 🔤 Naqta — نقطة
### Arabic Punctuation Restoration
[](https://huggingface.co/MostafaMaroof/Naqta)
[](https://huggingface.co/MostafaMaroof/Naqta)
[](https://huggingface.co/MostafaMaroof/Naqta)
[](https://opensource.org/licenses/MIT)
[](https://huggingface.co/MostafaMaroof/Naqta)
---
**Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**.
> 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta)
---
## ✨ What Does It Restore?
| Symbol | Name | Example |
|:---:|---|---|
| `.` | Period | نهاية الجملة |
| `،` | Arabic comma | فاصلة عربية |
| `؟` | Arabic question mark | علامة استفهام |
| `!` | Exclamation mark | علامة تعجب |
| `:` | Colon | نقطتان |
| `؛` | Arabic semicolon | فاصلة منقوطة |
| `-` | Dash | شرطة |
---
## 🏆 Results
### Validation Metrics (v11d — Final)
| Metric | Score |
|---|---:|
| 🎯 **Macro F1** | **0.8960** |
| ✅ Accuracy | 0.9714 |
### Per-Class F1 Score
| Class | Symbol | F1 | Performance |
|---|:---:|---:|---|
| Exclamation | `!` | 0.8897 | 🟢 Excellent |
| Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent |
| Question mark | `؟` | 0.9665 | 🟢 Excellent |
| Dash | `-` | 0.9007 | 🟢 Excellent |
| Arabic comma | `،` | 0.8100 | 🟢 Excellent |
| Period | `.` | 0.8968 | 🟢 Excellent |
---
## 🗂️ Training Data
The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains.
### Corpus Sources
| Source | Rows | Domain |
|---|---:|---|
| **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) |
| **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) |
| **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) |
| **Wikipedia (AR)** | ~98,500 | Encyclopedia articles |
| **CBT** | ~69,000 | Classical Arabic books & religious texts |
| **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) |
| **Total (raw)** | **~1,441,000** | — |
> All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training.
### Punctuation Coverage (raw corpus)
| Mark | Name | Paragraphs | Coverage |
|:---:|---|---:|---:|
| `،` | Arabic comma | 922,721 | 64.0% |
| `:` | Colon | 230,150 | 16.0% |
| `؛` | Arabic semicolon | 128,744 | 8.9% |
| `؟` | Question mark | 50,282 | 3.5% |
| `!` | Exclamation | 15,976 | 1.1% |
| `-` | Dash | ~1 | <0.1% |
### Data Balance Strategy
To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:
| Strategy | Marks | Multiplier | Cap |
|---|:---:|:---:|---:|
| Strong oversampling | `؟` `!` | ×8 | 80,000 rows |
| Light oversampling | `؛` `-` | ×6 | 80,000 rows |
After oversampling, the combined training pool grew to **~2.4 million paragraphs**.
### Dataset Splits
| Split | Sequences | Share |
|---|---:|---:|
| Train (capped) | 1,000,000 | 85% |
| Validation | 40,000 | 10% |
| Test | — | 5% |
- **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only
- Validation and test sets remain un-windowed for clean, unbiased evaluation
- Splits were stratified by the rarest punctuation mark in each sequence
### Preprocessing
- Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped
- Label assigned per word = punctuation mark **following** that word
- Multi-subword words: only the first subword receives the label; others are masked (`-100`)
---
## ⚙️ Model Architecture & Training
| Setting | Value |
|---|---|
| Base model | `xlm-roberta-large` (~560M params) |
| Task | Token classification (8 labels) |
| Max sequence length | 384 tokens |
| Training examples | 1,000,000 (capped) |
| Validation examples | 40,000 |
### Two-Phase Training
| Phase | Epochs | LR | Loss | Notes |
|---|:---:|---|---|---|
| Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning |
| Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen |
### Class Weights
Rare class weights were additionally boosted:
| Class | Boost |
|:---:|---|
| `؟` | ×1.2 |
| `!` | ×3.0 |
| `؛` | ×2.0 |
| `-` | ×1.3 |
---
## 🚀 Quick Start
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
repo_id = "MostafaMaroof/Naqta"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
model.eval()
id2label = model.config.id2label
text = "بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024 وهو اعلى مستوى منذ خمس سنوات"
words = text.split()
inputs = tokenizer(
words,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=384,
)
with torch.no_grad():
logits = model(**inputs).logits
pred_ids = logits.argmax(dim=-1)[0].tolist()
word_ids = inputs.word_ids(batch_index=0)
restored_words = []
previous_word_id = None
for token_id, word_id in zip(pred_ids, word_ids):
if word_id is None or word_id == previous_word_id:
continue
word = words[word_id]
label = id2label[token_id]
if label != "O":
word = word + label
restored_words.append(word)
previous_word_id = word_id
restored_text = " ".join(restored_words)
print(restored_text)
# → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.
```
---
## 📖 Example
**Input** (unpunctuated):
```
اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
```
**Output** (restored):
```
اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
```
**Question example:**
```
من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع
```
```
من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟
```
---
## 🎯 Intended Use
Naqta is well-suited for:
- 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts
- 📄 **Readability enhancement** — making raw Arabic text easier to read
- 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks
- 🔬 **Research** — Arabic punctuation restoration benchmark evaluation
---
## ⚠️ Limitations
- Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input.
- Performance may degrade on highly dialectal, technical, or domain-specific text.
- The model does not predict quotation marks or dialogue markers (`«»`).
- Very short or fragmented text (< 5 words) may produce less reliable results.
- The model predicts punctuation position only and does not perform grammar correction.
---
## 📜 License
This model is released under the **MIT License**.
---
## 🔗 Citation
If you use Naqta in your work, please reference:
```bibtex
@misc{naqta2025,
title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
author = {MostafaMaroof},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/MostafaMaroof/Naqta}
}
```