punctuator_model / README.md

update

660a0d5 verified 4 days ago

5.64 kB

	---
	license: apache-2.0
	tags:
	- xlm-roberta
	- punctuation-restoration
	- kyrgyz
	- nlp
	- onnx
	- transformer
	- low-resource-languages
	- asr-postprocessing
	- token-classification
	language:
	- ky
	pipeline_tag: token-classification
	datasets:
	- custom
	metrics:
	- precision
	- recall
	- f1
	---

	# Kyrgyz Punctuation Restoration — XLM-RoBERTa

	The first punctuation restoration model for the Kyrgyz language, achieving 94.1% precision and 90.3% F1-score — surpassing benchmarks for other low-resource languages.

	📄 Published research: "AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language" — Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)

	---

	## Highlights

	- 🏆 F1-score: 90.3% — outperforms comparable low-resource language models
	- 🌍 First-of-its-kind for Kyrgyz (Turkic language family, ~7M speakers)
	- ⚡ ONNX format — optimized for fast inference across frameworks
	- 🎙️ ASR post-processing — designed to restore punctuation in speech-to-text output

	---

	## Performance

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Precision \| 94.1% \|
	\| Recall \| 86.8% \|
	\| F1-Score \| 90.3% \|

	### Cross-Lingual Comparison

	\| Model \| Language \| F1-Score \|
	\|-------\|----------\|----------\|
	\| Ours (XLM-RoBERTa) \| Kyrgyz \| 90.3% \|
	\| Alam et al. (2020) \| English (clean) \| 87.0% \|
	\| Alam et al. (2020) \| Bangla \| 69.5% \|
	\| Nagy et al. (2021) \| Hungarian \| ~82.0% \|

	The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.

	---

	## Model Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base model \| XLM-RoBERTa-base \|
	\| Parameters \| ~270M \|
	\| Transformer layers \| 12 \|
	\| Hidden dimensions \| 768 \|
	\| Attention heads \| 12 \|
	\| Export format \| ONNX \|

	---

	## Training Details

	### Dataset

	A custom-built 200 MB Kyrgyz text corpus, collected over 2 months:

	\| Source \| Size \| Description \|
	\|--------\|------\|-------------\|
	\| Kyrgyz-Turkish Manas University Library \| 135 MB \| Books (literature, math, physics) \|
	\| Kyrgyz Wikipedia \| 40 MB \| Encyclopedia articles \|
	\| News portals \| 25 MB \| Journalistic text \|

	Preprocessing pipeline: PDF → EasyOCR text extraction → manual cleaning → JSON formatting with punctuation labels.

	### Data Augmentation

	Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:

	- Back-translation: Kyrgyz → English → Kyrgyz (simulating ASR-like errors)
	- Token-level modifications: Random insertions, deletions, swaps
	- Morphological transformations: Case form and morpheme modifications preserving grammatical correctness

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Batch size \| 32 \|
	\| Epochs \| 10 \|
	\| Optimizer \| Adam \|
	\| Learning rate \| 5e-5 \|
	\| Regularization \| Dropout \|
	\| Hardware \| Google Colab TPU \|
	\| Training time \| 42 hours \|

	---

	## How to Use

	```python
	import onnxruntime as ort
	import numpy as np

	# Load the ONNX model
	session = ort.InferenceSession("model.onnx")

	# Prepare input (see config.yaml for tokenizer settings)
	# The model predicts punctuation labels for each token:
	# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION

	# Example inference
	input_text = "бул кыргыз тилиндеги текст"
	# Tokenize and run inference (see main.py for full pipeline)
	```

	### Repository Structure

	```
	├── model.onnx # Trained model in ONNX format (1.11 GB)
	├── main.py # Inference pipeline
	├── env.py # Environment configuration
	├── config.yaml # Hyperparameters and model config
	├── requirements.txt # Python dependencies
	└── Files/ # Additional model files
	```

	---

	## Intended Use

	\| Use Case \| Description \|
	\|----------\|-------------\|
	\| ASR post-processing \| Restore punctuation in speech-to-text output for Kyrgyz \|
	\| Text normalization \| Clean and format raw Kyrgyz text with proper punctuation \|
	\| NLP preprocessing \| Improve downstream task performance (NER, MT, summarization) \|
	\| Accessibility \| Enhance readability of automatically generated Kyrgyz content \|

	---

	## Limitations

	- Rare punctuation marks: Lower accuracy on question marks and exclamation points due to class imbalance in training data
	- Formal text bias: Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
	- Morpheme boundary errors: Occasional difficulty placing punctuation in complex agglutinative constructions
	- Domain specificity: Best performance on prose-style text; specialized domains may require additional fine-tuning

	---

	## Future Directions

	- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
	- Morphology-aware tokenization to replace standard BPE
	- Expanded dataset with informal and conversational Kyrgyz text
	- Integration with Kyrgyz ASR systems for end-to-end speech processing

	---

	## Citation

	```bibtex
	@article{uvalieva2024punctuation,
	author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
	title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
	year = {2024},
	institution = {Kyrgyz-Turkish Manas University}
	}
	```

	---

	## Author

	Zarina Uvalieva — ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

	- 🤗 [HuggingFace](https://huggingface.co/Zarinaaa)
	- 📧 zarina.uvalievaa@gmail.com