Naqta / README.md

Update README.md

3ce58ba verified 17 days ago

8.95 kB

	---
	language:
	- ar
	license: mit
	tags:
	- arabic
	- punctuation-restoration
	- token-classification
	- xlm-roberta
	- natural-language-processing
	pipeline_tag: token-classification
	base_model: xlm-roberta-large
	model-index:
	- name: Naqta
	results:
	- task:
	type: token-classification
	name: Arabic Punctuation Restoration
	dataset:
	name: Mixed Arabic punctuation restoration corpus
	type: custom
	metrics:
	- type: f1
	value: 0.8960
	name: Validation Macro F1
	- type: accuracy
	value: 0.9714
	name: Validation Accuracy
	---

	<div align="center">

	# 🔤 Naqta — نقطة

	### Arabic Punctuation Restoration

	[![Model](https://img.shields.io/badge/🤗%20Model-MostafaMaroof%2FNaqta-blue)](https://huggingface.co/MostafaMaroof/Naqta)
	[![Language](https://img.shields.io/badge/Language-Arabic-green)](https://huggingface.co/MostafaMaroof/Naqta)
	[![Task](https://img.shields.io/badge/Task-Token%20Classification-orange)](https://huggingface.co/MostafaMaroof/Naqta)
	[![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT)
	[![Macro F1](https://img.shields.io/badge/Macro%20F1-89.6%25-brightgreen)](https://huggingface.co/MostafaMaroof/Naqta)

	</div>

	---

	Naqta (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of XLM-RoBERTa Large.

	> 💡 Try it live on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta)

	---

	## ✨ What Does It Restore?

	\| Symbol \| Name \| Example \|
	\|:---:\|---\|---\|
	\| `.` \| Period \| نهاية الجملة \|
	\| `،` \| Arabic comma \| فاصلة عربية \|
	\| `؟` \| Arabic question mark \| علامة استفهام \|
	\| `!` \| Exclamation mark \| علامة تعجب \|
	\| `:` \| Colon \| نقطتان \|
	\| `؛` \| Arabic semicolon \| فاصلة منقوطة \|
	\| `-` \| Dash \| شرطة \|

	---

	## 🏆 Results

	### Validation Metrics (v11d — Final)

	\| Metric \| Score \|
	\|---\|---:\|
	\| 🎯 Macro F1 \| 0.8960 \|
	\| ✅ Accuracy \| 0.9714 \|

	### Per-Class F1 Score

	\| Class \| Symbol \| F1 \| Performance \|
	\|---\|:---:\|---:\|---\|
	\| Exclamation \| `!` \| 0.8897 \| 🟢 Excellent \|
	\| Arabic semicolon \| `؛` \| 0.8042 \| 🟢 Excellent \|
	\| Question mark \| `؟` \| 0.9665 \| 🟢 Excellent \|
	\| Dash \| `-` \| 0.9007 \| 🟢 Excellent \|
	\| Arabic comma \| `،` \| 0.8100 \| 🟢 Excellent \|
	\| Period \| `.` \| 0.8968 \| 🟢 Excellent \|

	---

	## 🗂️ Training Data

	The model was trained on a large multi-source Arabic corpus totaling over 1.4 million paragraphs from six diverse sources, covering a broad range of Arabic writing styles and domains.

	### Corpus Sources

	\| Source \| Rows \| Domain \|
	\|---\|---:\|---\|
	\| ABC / UNPC \| ~1,020,000 \| News & formal Arabic (United Nations Parallel Corpus) \|
	\| HF Tashkeel \| ~151,000 \| Vocalized Arabic text (diacritized corpus) \|
	\| Hindawi E-Books \| ~100,000 \| Literary Arabic prose (novels & non-fiction) \|
	\| Wikipedia (AR) \| ~98,500 \| Encyclopedia articles \|
	\| CBT \| ~69,000 \| Classical Arabic books & religious texts \|
	\| ARCD + XQuAD \| ~2,050 \| Arabic QA pairs (rich in question marks `؟`) \|
	\| Total (raw) \| ~1,441,000 \| — \|

	> All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training.

	### Punctuation Coverage (raw corpus)

	\| Mark \| Name \| Paragraphs \| Coverage \|
	\|:---:\|---\|---:\|---:\|
	\| `،` \| Arabic comma \| 922,721 \| 64.0% \|
	\| `:` \| Colon \| 230,150 \| 16.0% \|
	\| `؛` \| Arabic semicolon \| 128,744 \| 8.9% \|
	\| `؟` \| Question mark \| 50,282 \| 3.5% \|
	\| `!` \| Exclamation \| 15,976 \| 1.1% \|
	\| `-` \| Dash \| ~1 \| <0.1% \|

	### Data Balance Strategy

	To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:

	\| Strategy \| Marks \| Multiplier \| Cap \|
	\|---\|:---:\|:---:\|---:\|
	\| Strong oversampling \| `؟` `!` \| ×8 \| 80,000 rows \|
	\| Light oversampling \| `؛` `-` \| ×6 \| 80,000 rows \|

	After oversampling, the combined training pool grew to ~2.4 million paragraphs.

	### Dataset Splits

	\| Split \| Sequences \| Share \|
	\|---\|---:\|---:\|
	\| Train (capped) \| 1,000,000 \| 85% \|
	\| Validation \| 40,000 \| 10% \|
	\| Test \| — \| 5% \|

	- Sliding-window context (window=3 sentences, stride=2) was applied to training data only
	- Validation and test sets remain un-windowed for clean, unbiased evaluation
	- Splits were stratified by the rarest punctuation mark in each sequence

	### Preprocessing

	- Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped
	- Label assigned per word = punctuation mark following that word
	- Multi-subword words: only the first subword receives the label; others are masked (`-100`)

	---

	## ⚙️ Model Architecture & Training

	\| Setting \| Value \|
	\|---\|---\|
	\| Base model \| `xlm-roberta-large` (~560M params) \|
	\| Task \| Token classification (8 labels) \|
	\| Max sequence length \| 384 tokens \|
	\| Training examples \| 1,000,000 (capped) \|
	\| Validation examples \| 40,000 \|

	### Two-Phase Training

	\| Phase \| Epochs \| LR \| Loss \| Notes \|
	\|---\|:---:\|---\|---\|---\|
	\| Phase 1 \| 2 \| 2e-5 \| Cross-entropy + label smoothing \| Full model fine-tuning \|
	\| Phase 2 \| 1 \| 6e-6 \| Focal loss (γ=2.0) + class weights \| Bottom 12 layers frozen \|

	### Class Weights

	Rare class weights were additionally boosted:

	\| Class \| Boost \|
	\|:---:\|---\|
	\| `؟` \| ×1.2 \|
	\| `!` \| ×3.0 \|
	\| `؛` \| ×2.0 \|
	\| `-` \| ×1.3 \|

	---

	## 🚀 Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	repo_id = "MostafaMaroof/Naqta"

	tokenizer = AutoTokenizer.from_pretrained(repo_id)
	model = AutoModelForTokenClassification.from_pretrained(repo_id)
	model.eval()

	id2label = model.config.id2label

	text = "بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024 وهو اعلى مستوى منذ خمس سنوات"
	words = text.split()

	inputs = tokenizer(
	words,
	is_split_into_words=True,
	return_tensors="pt",
	truncation=True,
	max_length=384,
	)

	with torch.no_grad():
	logits = model(**inputs).logits

	pred_ids = logits.argmax(dim=-1)[0].tolist()
	word_ids = inputs.word_ids(batch_index=0)

	restored_words = []
	previous_word_id = None
	for token_id, word_id in zip(pred_ids, word_ids):
	if word_id is None or word_id == previous_word_id:
	continue
	word = words[word_id]
	label = id2label[token_id]
	if label != "O":
	word = word + label
	restored_words.append(word)
	previous_word_id = word_id

	restored_text = " ".join(restored_words)
	print(restored_text)
	# → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.
	```

	---

	## 📖 Example

	Input (unpunctuated):
	```
	اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
	```

	Output (restored):
	```
	اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
	```

	Question example:
	```
	من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع
	```
	```
	من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟
	```

	---

	## 🎯 Intended Use

	Naqta is well-suited for:

	- 🎙️ ASR post-processing — restoring punctuation in Arabic speech transcripts
	- 📄 Readability enhancement — making raw Arabic text easier to read
	- 🔧 NLP preprocessing — improving text quality for downstream Arabic NLP tasks
	- 🔬 Research — Arabic punctuation restoration benchmark evaluation

	---

	## ⚠️ Limitations

	- Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input.
	- Performance may degrade on highly dialectal, technical, or domain-specific text.
	- The model does not predict quotation marks or dialogue markers (`«»`).
	- Very short or fragmented text (< 5 words) may produce less reliable results.
	- The model predicts punctuation position only and does not perform grammar correction.

	---

	## 📜 License

	This model is released under the MIT License.

	---

	## 🔗 Citation

	If you use Naqta in your work, please reference:

	```bibtex
	@misc{naqta2025,
	title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
	author = {MostafaMaroof},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/MostafaMaroof/Naqta}
	}
	```