MostafaMaroof commited on
Commit
3ce58ba
·
verified ·
1 Parent(s): 462bbda

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -82
README.md CHANGED
@@ -21,96 +21,162 @@ model-index:
21
  type: custom
22
  metrics:
23
  - type: f1
24
- value: 0.8176
25
  name: Validation Macro F1
26
  - type: accuracy
27
- value: 0.9589
28
  name: Validation Accuracy
29
  ---
30
 
31
- # Naqta
32
 
33
- **Naqta** is an Arabic punctuation restoration model. It predicts missing punctuation marks in unpunctuated Arabic text using token-level sequence classification.
34
 
35
- The model is designed to restore the following punctuation marks:
36
 
37
- | Label | Meaning |
38
- |---|---|
39
- | `O` | No punctuation |
40
- | `.` | Period |
41
- | `،` | Arabic comma |
42
- | `؟` | Arabic question mark |
43
- | `!` | Exclamation mark |
44
- | `:` | Colon |
45
- | `؛` | Arabic semicolon |
46
- | `-` | Dash |
47
-
48
- ## Model Details
49
 
50
- - **Model name:** Naqta
51
- - **Task:** Arabic punctuation restoration
52
- - **Architecture:** XLM-RoBERTa Large for token classification
53
- - **Base model:** `xlm-roberta-large`
54
- - **Maximum sequence length:** 384 tokens
55
- - **Training objective:** token-level punctuation classification
56
- - **Loss:** weighted focal loss during fine-tuning
57
- - **Focal gamma:** 2.0
58
 
59
- ## Training Summary
60
 
61
- Naqta was trained on a mixed Arabic corpus built from multiple sources, including books, Arabic corpora, Wikipedia-style text, and question-answering data. The training pipeline used sliding-window context, class balancing, rare punctuation oversampling, and a two-phase training strategy.
62
 
63
- ### Training Strategy
64
 
65
- | Phase | Description |
66
- |---|---|
67
- | Phase 1 | General token-classification training for 2 epochs |
68
- | Phase 2 | Focal-loss fine-tuning for 2 epochs with lower encoder layers frozen |
69
 
70
- ### Data Balancing
71
 
72
- The final training setup used stronger sampling for rare punctuation marks:
 
 
 
 
 
 
 
 
73
 
74
- - Strong rare marks: `؟`, `!`
75
- - Light rare marks: `؛`, `-`
76
- - Sliding-window context was applied to training data only
77
- - Validation and test data remained unwindowed to avoid leakage
78
 
79
- ## Validation Results
80
 
81
- Final best validation result:
82
 
83
  | Metric | Score |
84
  |---|---:|
85
- | **Macro F1** | **0.8176** |
86
- | Accuracy | 0.9589 |
87
 
88
- ### Per-Class Validation F1
89
 
90
- | Class | F1 |
91
- |---|---:|
92
- | `!` | 0.6512 |
93
- | `؛` | 0.7180 |
94
- | `؟` | 0.9066 |
95
- | `-` | 0.8562 |
96
- | `،` | 0.7422 |
97
- | `.` | 0.8030 |
98
 
99
- ## Example
100
 
101
- Input:
102
 
103
- ```text
104
- اذا اردت ان تنجح في حياتك فعليك ان تفتح اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
105
- ```
106
 
107
- Possible output:
108
 
109
- ```text
110
- اذا اردت ان تنجح في حياتك، فعليك ان تفتح اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
111
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
 
115
  ```python
116
  from transformers import AutoTokenizer, AutoModelForTokenClassification
@@ -146,7 +212,6 @@ previous_word_id = None
146
  for token_id, word_id in zip(pred_ids, word_ids):
147
  if word_id is None or word_id == previous_word_id:
148
  continue
149
-
150
  word = words[word_id]
151
  label = id2label[token_id]
152
  if label != "O":
@@ -156,37 +221,70 @@ for token_id, word_id in zip(pred_ids, word_ids):
156
 
157
  restored_text = " ".join(restored_words)
158
  print(restored_text)
 
159
  ```
160
 
161
- ## Intended Use
162
 
163
- Naqta can be used for:
164
 
165
- - Restoring punctuation in Arabic ASR transcripts
166
- - Improving readability of unpunctuated Arabic text
167
- - Preprocessing Arabic text for downstream NLP tasks
168
- - Educational or research applications involving Arabic punctuation
169
 
170
- ## Limitations
 
 
 
171
 
172
- - Punctuation restoration is partly stylistic, so multiple outputs may be valid.
173
- - The model may over-insert commas in long literary or formal sentences.
174
- - Very short or fragmented text may produce less reliable punctuation.
175
- - Domain-specific text, such as legal, medical, or highly dialectal content, may require additional fine-tuning.
176
- - The model predicts punctuation after words and does not perform full grammar correction.
 
 
 
 
177
 
178
- ## Training Notes
179
 
180
- The model was optimized to improve rare punctuation classes, especially `!`, `؟`, `؛`, and `-`. The final configuration achieved a validation Macro F1 above 0.81, with especially strong performance on question marks and dashes.
 
 
 
 
 
 
 
181
 
182
- ## License
183
 
184
- This model is released under the MIT License.
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
- ## Citation
187
 
188
- If you use this model, please cite or reference the Hugging Face repository:
189
 
190
- ```text
191
- MostafaMaroof/Naqta
 
 
 
 
 
 
192
  ```
 
21
  type: custom
22
  metrics:
23
  - type: f1
24
+ value: 0.8960
25
  name: Validation Macro F1
26
  - type: accuracy
27
+ value: 0.9714
28
  name: Validation Accuracy
29
  ---
30
 
31
+ <div align="center">
32
 
33
+ # 🔤 Naqta نقطة
34
 
35
+ ### Arabic Punctuation Restoration
36
 
37
+ [![Model](https://img.shields.io/badge/🤗%20Model-MostafaMaroof%2FNaqta-blue)](https://huggingface.co/MostafaMaroof/Naqta)
38
+ [![Language](https://img.shields.io/badge/Language-Arabic-green)](https://huggingface.co/MostafaMaroof/Naqta)
39
+ [![Task](https://img.shields.io/badge/Task-Token%20Classification-orange)](https://huggingface.co/MostafaMaroof/Naqta)
40
+ [![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT)
41
+ [![Macro F1](https://img.shields.io/badge/Macro%20F1-89.6%25-brightgreen)](https://huggingface.co/MostafaMaroof/Naqta)
 
 
 
 
 
 
 
42
 
43
+ </div>
 
 
 
 
 
 
 
44
 
45
+ ---
46
 
47
+ **Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**.
48
 
49
+ > 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta)
50
 
51
+ ---
 
 
 
52
 
53
+ ## What Does It Restore?
54
 
55
+ | Symbol | Name | Example |
56
+ |:---:|---|---|
57
+ | `.` | Period | نهاية الجملة |
58
+ | `،` | Arabic comma | فاصلة عربية |
59
+ | `؟` | Arabic question mark | علامة استفهام |
60
+ | `!` | Exclamation mark | علامة تعجب |
61
+ | `:` | Colon | نقطتان |
62
+ | `؛` | Arabic semicolon | فاصلة منقوطة |
63
+ | `-` | Dash | شرطة |
64
 
65
+ ---
 
 
 
66
 
67
+ ## 🏆 Results
68
 
69
+ ### Validation Metrics (v11d — Final)
70
 
71
  | Metric | Score |
72
  |---|---:|
73
+ | 🎯 **Macro F1** | **0.8960** |
74
+ | Accuracy | 0.9714 |
75
 
76
+ ### Per-Class F1 Score
77
 
78
+ | Class | Symbol | F1 | Performance |
79
+ |---|:---:|---:|---|
80
+ | Exclamation | `!` | 0.8897 | 🟢 Excellent |
81
+ | Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent |
82
+ | Question mark | `؟` | 0.9665 | 🟢 Excellent |
83
+ | Dash | `-` | 0.9007 | 🟢 Excellent |
84
+ | Arabic comma | `،` | 0.8100 | 🟢 Excellent |
85
+ | Period | `.` | 0.8968 | 🟢 Excellent |
86
 
87
+ ---
88
 
89
+ ## 🗂️ Training Data
90
 
91
+ The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains.
 
 
92
 
93
+ ### Corpus Sources
94
 
95
+ | Source | Rows | Domain |
96
+ |---|---:|---|
97
+ | **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) |
98
+ | **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) |
99
+ | **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) |
100
+ | **Wikipedia (AR)** | ~98,500 | Encyclopedia articles |
101
+ | **CBT** | ~69,000 | Classical Arabic books & religious texts |
102
+ | **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) |
103
+ | **Total (raw)** | **~1,441,000** | — |
104
+
105
+ > All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training.
106
+
107
+ ### Punctuation Coverage (raw corpus)
108
+
109
+ | Mark | Name | Paragraphs | Coverage |
110
+ |:---:|---|---:|---:|
111
+ | `،` | Arabic comma | 922,721 | 64.0% |
112
+ | `:` | Colon | 230,150 | 16.0% |
113
+ | `؛` | Arabic semicolon | 128,744 | 8.9% |
114
+ | `؟` | Question mark | 50,282 | 3.5% |
115
+ | `!` | Exclamation | 15,976 | 1.1% |
116
+ | `-` | Dash | ~1 | <0.1% |
117
+
118
+ ### Data Balance Strategy
119
+
120
+ To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:
121
+
122
+ | Strategy | Marks | Multiplier | Cap |
123
+ |---|:---:|:---:|---:|
124
+ | Strong oversampling | `؟` `!` | ×8 | 80,000 rows |
125
+ | Light oversampling | `؛` `-` | ×6 | 80,000 rows |
126
+
127
+ After oversampling, the combined training pool grew to **~2.4 million paragraphs**.
128
+
129
+ ### Dataset Splits
130
+
131
+ | Split | Sequences | Share |
132
+ |---|---:|---:|
133
+ | Train (capped) | 1,000,000 | 85% |
134
+ | Validation | 40,000 | 10% |
135
+ | Test | — | 5% |
136
+
137
+ - **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only
138
+ - Validation and test sets remain un-windowed for clean, unbiased evaluation
139
+ - Splits were stratified by the rarest punctuation mark in each sequence
140
+
141
+ ### Preprocessing
142
+
143
+ - Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped
144
+ - Label assigned per word = punctuation mark **following** that word
145
+ - Multi-subword words: only the first subword receives the label; others are masked (`-100`)
146
+
147
+ ---
148
+
149
+ ## ⚙️ Model Architecture & Training
150
+
151
+ | Setting | Value |
152
+ |---|---|
153
+ | Base model | `xlm-roberta-large` (~560M params) |
154
+ | Task | Token classification (8 labels) |
155
+ | Max sequence length | 384 tokens |
156
+ | Training examples | 1,000,000 (capped) |
157
+ | Validation examples | 40,000 |
158
+
159
+ ### Two-Phase Training
160
 
161
+ | Phase | Epochs | LR | Loss | Notes |
162
+ |---|:---:|---|---|---|
163
+ | Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning |
164
+ | Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen |
165
+
166
+ ### Class Weights
167
+
168
+ Rare class weights were additionally boosted:
169
+
170
+ | Class | Boost |
171
+ |:---:|---|
172
+ | `؟` | ×1.2 |
173
+ | `!` | ×3.0 |
174
+ | `؛` | ×2.0 |
175
+ | `-` | ×1.3 |
176
+
177
+ ---
178
+
179
+ ## 🚀 Quick Start
180
 
181
  ```python
182
  from transformers import AutoTokenizer, AutoModelForTokenClassification
 
212
  for token_id, word_id in zip(pred_ids, word_ids):
213
  if word_id is None or word_id == previous_word_id:
214
  continue
 
215
  word = words[word_id]
216
  label = id2label[token_id]
217
  if label != "O":
 
221
 
222
  restored_text = " ".join(restored_words)
223
  print(restored_text)
224
+ # → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.
225
  ```
226
 
227
+ ---
228
 
229
+ ## 📖 Example
230
 
231
+ **Input** (unpunctuated):
232
+ ```
233
+ اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
234
+ ```
235
 
236
+ **Output** (restored):
237
+ ```
238
+ اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
239
+ ```
240
 
241
+ **Question example:**
242
+ ```
243
+ من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع
244
+ ```
245
+ ```
246
+ من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟
247
+ ```
248
+
249
+ ---
250
 
251
+ ## 🎯 Intended Use
252
 
253
+ Naqta is well-suited for:
254
+
255
+ - 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts
256
+ - 📄 **Readability enhancement** — making raw Arabic text easier to read
257
+ - 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks
258
+ - 🔬 **Research** — Arabic punctuation restoration benchmark evaluation
259
+
260
+ ---
261
 
262
+ ## ⚠️ Limitations
263
 
264
+ - Punctuation restoration is partly stylistic multiple valid outputs may exist for a single input.
265
+ - Performance may degrade on highly dialectal, technical, or domain-specific text.
266
+ - The model does not predict quotation marks or dialogue markers (`«»`).
267
+ - Very short or fragmented text (< 5 words) may produce less reliable results.
268
+ - The model predicts punctuation position only and does not perform grammar correction.
269
+
270
+ ---
271
+
272
+ ## 📜 License
273
+
274
+ This model is released under the **MIT License**.
275
+
276
+ ---
277
 
278
+ ## 🔗 Citation
279
 
280
+ If you use Naqta in your work, please reference:
281
 
282
+ ```bibtex
283
+ @misc{naqta2025,
284
+ title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
285
+ author = {MostafaMaroof},
286
+ year = {2025},
287
+ publisher = {Hugging Face},
288
+ url = {https://huggingface.co/MostafaMaroof/Naqta}
289
+ }
290
  ```