Update README.md
Browse files
README.md
CHANGED
|
@@ -21,96 +21,162 @@ model-index:
|
|
| 21 |
type: custom
|
| 22 |
metrics:
|
| 23 |
- type: f1
|
| 24 |
-
value: 0.
|
| 25 |
name: Validation Macro F1
|
| 26 |
- type: accuracy
|
| 27 |
-
value: 0.
|
| 28 |
name: Validation Accuracy
|
| 29 |
---
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
| `؟` | Arabic question mark |
|
| 43 |
-
| `!` | Exclamation mark |
|
| 44 |
-
| `:` | Colon |
|
| 45 |
-
| `؛` | Arabic semicolon |
|
| 46 |
-
| `-` | Dash |
|
| 47 |
-
|
| 48 |
-
## Model Details
|
| 49 |
|
| 50 |
-
|
| 51 |
-
- **Task:** Arabic punctuation restoration
|
| 52 |
-
- **Architecture:** XLM-RoBERTa Large for token classification
|
| 53 |
-
- **Base model:** `xlm-roberta-large`
|
| 54 |
-
- **Maximum sequence length:** 384 tokens
|
| 55 |
-
- **Training objective:** token-level punctuation classification
|
| 56 |
-
- **Loss:** weighted focal loss during fine-tuning
|
| 57 |
-
- **Focal gamma:** 2.0
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
Naqta
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
-
|
| 66 |
-
|---|---|
|
| 67 |
-
| Phase 1 | General token-classification training for 2 epochs |
|
| 68 |
-
| Phase 2 | Focal-loss fine-tuning for 2 epochs with lower encoder layers frozen |
|
| 69 |
|
| 70 |
-
##
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
-
-
|
| 75 |
-
- Light rare marks: `؛`, `-`
|
| 76 |
-
- Sliding-window context was applied to training data only
|
| 77 |
-
- Validation and test data remained unwindowed to avoid leakage
|
| 78 |
|
| 79 |
-
##
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
| Metric | Score |
|
| 84 |
|---|---:|
|
| 85 |
-
| **Macro F1** | **0.
|
| 86 |
-
| Accuracy | 0.
|
| 87 |
|
| 88 |
-
### Per-Class
|
| 89 |
|
| 90 |
-
| Class | F1 |
|
| 91 |
-
|---|---:|
|
| 92 |
-
| `!` | 0.
|
| 93 |
-
| `؛` | 0.
|
| 94 |
-
| `؟` | 0.
|
| 95 |
-
| `-` | 0.
|
| 96 |
-
| `،` | 0.
|
| 97 |
-
| `.` | 0.
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
| 102 |
|
| 103 |
-
|
| 104 |
-
اذا اردت ان تنجح في حياتك فعليك ان تفتح اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
|
| 105 |
-
```
|
| 106 |
|
| 107 |
-
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 112 |
|
| 113 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
```python
|
| 116 |
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
|
@@ -146,7 +212,6 @@ previous_word_id = None
|
|
| 146 |
for token_id, word_id in zip(pred_ids, word_ids):
|
| 147 |
if word_id is None or word_id == previous_word_id:
|
| 148 |
continue
|
| 149 |
-
|
| 150 |
word = words[word_id]
|
| 151 |
label = id2label[token_id]
|
| 152 |
if label != "O":
|
|
@@ -156,37 +221,70 @@ for token_id, word_id in zip(pred_ids, word_ids):
|
|
| 156 |
|
| 157 |
restored_text = " ".join(restored_words)
|
| 158 |
print(restored_text)
|
|
|
|
| 159 |
```
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
|
| 164 |
|
| 165 |
-
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
|
| 170 |
-
|
|
|
|
|
|
|
|
|
|
| 171 |
|
| 172 |
-
|
| 173 |
-
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 177 |
|
| 178 |
-
##
|
| 179 |
|
| 180 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
|
| 182 |
-
##
|
| 183 |
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
-
## Citation
|
| 187 |
|
| 188 |
-
If you use
|
| 189 |
|
| 190 |
-
```
|
| 191 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
```
|
|
|
|
| 21 |
type: custom
|
| 22 |
metrics:
|
| 23 |
- type: f1
|
| 24 |
+
value: 0.8960
|
| 25 |
name: Validation Macro F1
|
| 26 |
- type: accuracy
|
| 27 |
+
value: 0.9714
|
| 28 |
name: Validation Accuracy
|
| 29 |
---
|
| 30 |
|
| 31 |
+
<div align="center">
|
| 32 |
|
| 33 |
+
# 🔤 Naqta — نقطة
|
| 34 |
|
| 35 |
+
### Arabic Punctuation Restoration
|
| 36 |
|
| 37 |
+
[](https://huggingface.co/MostafaMaroof/Naqta)
|
| 38 |
+
[](https://huggingface.co/MostafaMaroof/Naqta)
|
| 39 |
+
[](https://huggingface.co/MostafaMaroof/Naqta)
|
| 40 |
+
[](https://opensource.org/licenses/MIT)
|
| 41 |
+
[](https://huggingface.co/MostafaMaroof/Naqta)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
</div>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
---
|
| 46 |
|
| 47 |
+
**Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**.
|
| 48 |
|
| 49 |
+
> 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta)
|
| 50 |
|
| 51 |
+
---
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
## ✨ What Does It Restore?
|
| 54 |
|
| 55 |
+
| Symbol | Name | Example |
|
| 56 |
+
|:---:|---|---|
|
| 57 |
+
| `.` | Period | نهاية الجملة |
|
| 58 |
+
| `،` | Arabic comma | فاصلة عربية |
|
| 59 |
+
| `؟` | Arabic question mark | علامة استفهام |
|
| 60 |
+
| `!` | Exclamation mark | علامة تعجب |
|
| 61 |
+
| `:` | Colon | نقطتان |
|
| 62 |
+
| `؛` | Arabic semicolon | فاصلة منقوطة |
|
| 63 |
+
| `-` | Dash | شرطة |
|
| 64 |
|
| 65 |
+
---
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
## 🏆 Results
|
| 68 |
|
| 69 |
+
### Validation Metrics (v11d — Final)
|
| 70 |
|
| 71 |
| Metric | Score |
|
| 72 |
|---|---:|
|
| 73 |
+
| 🎯 **Macro F1** | **0.8960** |
|
| 74 |
+
| ✅ Accuracy | 0.9714 |
|
| 75 |
|
| 76 |
+
### Per-Class F1 Score
|
| 77 |
|
| 78 |
+
| Class | Symbol | F1 | Performance |
|
| 79 |
+
|---|:---:|---:|---|
|
| 80 |
+
| Exclamation | `!` | 0.8897 | 🟢 Excellent |
|
| 81 |
+
| Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent |
|
| 82 |
+
| Question mark | `؟` | 0.9665 | 🟢 Excellent |
|
| 83 |
+
| Dash | `-` | 0.9007 | 🟢 Excellent |
|
| 84 |
+
| Arabic comma | `،` | 0.8100 | 🟢 Excellent |
|
| 85 |
+
| Period | `.` | 0.8968 | 🟢 Excellent |
|
| 86 |
|
| 87 |
+
---
|
| 88 |
|
| 89 |
+
## 🗂️ Training Data
|
| 90 |
|
| 91 |
+
The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains.
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
### Corpus Sources
|
| 94 |
|
| 95 |
+
| Source | Rows | Domain |
|
| 96 |
+
|---|---:|---|
|
| 97 |
+
| **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) |
|
| 98 |
+
| **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) |
|
| 99 |
+
| **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) |
|
| 100 |
+
| **Wikipedia (AR)** | ~98,500 | Encyclopedia articles |
|
| 101 |
+
| **CBT** | ~69,000 | Classical Arabic books & religious texts |
|
| 102 |
+
| **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) |
|
| 103 |
+
| **Total (raw)** | **~1,441,000** | — |
|
| 104 |
+
|
| 105 |
+
> All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training.
|
| 106 |
+
|
| 107 |
+
### Punctuation Coverage (raw corpus)
|
| 108 |
+
|
| 109 |
+
| Mark | Name | Paragraphs | Coverage |
|
| 110 |
+
|:---:|---|---:|---:|
|
| 111 |
+
| `،` | Arabic comma | 922,721 | 64.0% |
|
| 112 |
+
| `:` | Colon | 230,150 | 16.0% |
|
| 113 |
+
| `؛` | Arabic semicolon | 128,744 | 8.9% |
|
| 114 |
+
| `؟` | Question mark | 50,282 | 3.5% |
|
| 115 |
+
| `!` | Exclamation | 15,976 | 1.1% |
|
| 116 |
+
| `-` | Dash | ~1 | <0.1% |
|
| 117 |
+
|
| 118 |
+
### Data Balance Strategy
|
| 119 |
+
|
| 120 |
+
To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:
|
| 121 |
+
|
| 122 |
+
| Strategy | Marks | Multiplier | Cap |
|
| 123 |
+
|---|:---:|:---:|---:|
|
| 124 |
+
| Strong oversampling | `؟` `!` | ×8 | 80,000 rows |
|
| 125 |
+
| Light oversampling | `؛` `-` | ×6 | 80,000 rows |
|
| 126 |
+
|
| 127 |
+
After oversampling, the combined training pool grew to **~2.4 million paragraphs**.
|
| 128 |
+
|
| 129 |
+
### Dataset Splits
|
| 130 |
+
|
| 131 |
+
| Split | Sequences | Share |
|
| 132 |
+
|---|---:|---:|
|
| 133 |
+
| Train (capped) | 1,000,000 | 85% |
|
| 134 |
+
| Validation | 40,000 | 10% |
|
| 135 |
+
| Test | — | 5% |
|
| 136 |
+
|
| 137 |
+
- **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only
|
| 138 |
+
- Validation and test sets remain un-windowed for clean, unbiased evaluation
|
| 139 |
+
- Splits were stratified by the rarest punctuation mark in each sequence
|
| 140 |
+
|
| 141 |
+
### Preprocessing
|
| 142 |
+
|
| 143 |
+
- Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped
|
| 144 |
+
- Label assigned per word = punctuation mark **following** that word
|
| 145 |
+
- Multi-subword words: only the first subword receives the label; others are masked (`-100`)
|
| 146 |
+
|
| 147 |
+
---
|
| 148 |
+
|
| 149 |
+
## ⚙️ Model Architecture & Training
|
| 150 |
+
|
| 151 |
+
| Setting | Value |
|
| 152 |
+
|---|---|
|
| 153 |
+
| Base model | `xlm-roberta-large` (~560M params) |
|
| 154 |
+
| Task | Token classification (8 labels) |
|
| 155 |
+
| Max sequence length | 384 tokens |
|
| 156 |
+
| Training examples | 1,000,000 (capped) |
|
| 157 |
+
| Validation examples | 40,000 |
|
| 158 |
+
|
| 159 |
+
### Two-Phase Training
|
| 160 |
|
| 161 |
+
| Phase | Epochs | LR | Loss | Notes |
|
| 162 |
+
|---|:---:|---|---|---|
|
| 163 |
+
| Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning |
|
| 164 |
+
| Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen |
|
| 165 |
+
|
| 166 |
+
### Class Weights
|
| 167 |
+
|
| 168 |
+
Rare class weights were additionally boosted:
|
| 169 |
+
|
| 170 |
+
| Class | Boost |
|
| 171 |
+
|:---:|---|
|
| 172 |
+
| `؟` | ×1.2 |
|
| 173 |
+
| `!` | ×3.0 |
|
| 174 |
+
| `؛` | ×2.0 |
|
| 175 |
+
| `-` | ×1.3 |
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
|
| 179 |
+
## 🚀 Quick Start
|
| 180 |
|
| 181 |
```python
|
| 182 |
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
|
|
|
| 212 |
for token_id, word_id in zip(pred_ids, word_ids):
|
| 213 |
if word_id is None or word_id == previous_word_id:
|
| 214 |
continue
|
|
|
|
| 215 |
word = words[word_id]
|
| 216 |
label = id2label[token_id]
|
| 217 |
if label != "O":
|
|
|
|
| 221 |
|
| 222 |
restored_text = " ".join(restored_words)
|
| 223 |
print(restored_text)
|
| 224 |
+
# → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.
|
| 225 |
```
|
| 226 |
|
| 227 |
+
---
|
| 228 |
|
| 229 |
+
## 📖 Example
|
| 230 |
|
| 231 |
+
**Input** (unpunctuated):
|
| 232 |
+
```
|
| 233 |
+
اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
|
| 234 |
+
```
|
| 235 |
|
| 236 |
+
**Output** (restored):
|
| 237 |
+
```
|
| 238 |
+
اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
|
| 239 |
+
```
|
| 240 |
|
| 241 |
+
**Question example:**
|
| 242 |
+
```
|
| 243 |
+
من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع
|
| 244 |
+
```
|
| 245 |
+
```
|
| 246 |
+
من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
---
|
| 250 |
|
| 251 |
+
## 🎯 Intended Use
|
| 252 |
|
| 253 |
+
Naqta is well-suited for:
|
| 254 |
+
|
| 255 |
+
- 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts
|
| 256 |
+
- 📄 **Readability enhancement** — making raw Arabic text easier to read
|
| 257 |
+
- 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks
|
| 258 |
+
- 🔬 **Research** — Arabic punctuation restoration benchmark evaluation
|
| 259 |
+
|
| 260 |
+
---
|
| 261 |
|
| 262 |
+
## ⚠️ Limitations
|
| 263 |
|
| 264 |
+
- Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input.
|
| 265 |
+
- Performance may degrade on highly dialectal, technical, or domain-specific text.
|
| 266 |
+
- The model does not predict quotation marks or dialogue markers (`«»`).
|
| 267 |
+
- Very short or fragmented text (< 5 words) may produce less reliable results.
|
| 268 |
+
- The model predicts punctuation position only and does not perform grammar correction.
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## 📜 License
|
| 273 |
+
|
| 274 |
+
This model is released under the **MIT License**.
|
| 275 |
+
|
| 276 |
+
---
|
| 277 |
|
| 278 |
+
## 🔗 Citation
|
| 279 |
|
| 280 |
+
If you use Naqta in your work, please reference:
|
| 281 |
|
| 282 |
+
```bibtex
|
| 283 |
+
@misc{naqta2025,
|
| 284 |
+
title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
|
| 285 |
+
author = {MostafaMaroof},
|
| 286 |
+
year = {2025},
|
| 287 |
+
publisher = {Hugging Face},
|
| 288 |
+
url = {https://huggingface.co/MostafaMaroof/Naqta}
|
| 289 |
+
}
|
| 290 |
```
|