Zarinaaa
/

punctuator_model

 ---
 license: apache-2.0
+tags:
+  - xlm-roberta
+  - punctuation-restoration
+  - kyrgyz
+  - nlp
+  - onnx
+  - transformer
+  - low-resource-languages
+  - asr-postprocessing
+  - token-classification
 language:
+  - ky
+pipeline_tag: token-classification
+datasets:
+  - custom
 metrics:
+  - precision
+  - recall
+  - f1
 ---
+# Kyrgyz Punctuation Restoration — XLM-RoBERTa
+**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** — surpassing benchmarks for other low-resource languages.
+📄 **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* — Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)
+---
+## Highlights
+- 🏆 **F1-score: 90.3%** — outperforms comparable low-resource language models
+- 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
+- ⚡ **ONNX format** — optimized for fast inference across frameworks
+- 🎙️ **ASR post-processing** — designed to restore punctuation in speech-to-text output
+---
+## Performance
+| Metric | Score |
+|--------|-------|
+| **Precision** | 94.1% |
+| **Recall** | 86.8% |
+| **F1-Score** | 90.3% |
+### Cross-Lingual Comparison
+| Model | Language | F1-Score |
+|-------|----------|----------|
+| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
+| Alam et al. (2020) | English (clean) | 87.0% |
+| Alam et al. (2020) | Bangla | 69.5% |
+| Nagy et al. (2021) | Hungarian | ~82.0% |
+The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.
+---
+## Model Architecture
+| Parameter | Value |
+|-----------|-------|
+| Base model | XLM-RoBERTa-base |
+| Parameters | ~270M |
+| Transformer layers | 12 |
+| Hidden dimensions | 768 |
+| Attention heads | 12 |
+| Export format | ONNX |
+---
+## Training Details
+### Dataset
+A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:
+| Source | Size | Description |
+|--------|------|-------------|
+| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
+| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
+| News portals | 25 MB | Journalistic text |
+**Preprocessing pipeline:** PDF → EasyOCR text extraction → manual cleaning → JSON formatting with punctuation labels.
+### Data Augmentation
+Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:
+- **Back-translation:** Kyrgyz → English → Kyrgyz (simulating ASR-like errors)
+- **Token-level modifications:** Random insertions, deletions, swaps
+- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Batch size | 32 |
+| Epochs | 10 |
+| Optimizer | Adam |
+| Learning rate | 5e-5 |
+| Regularization | Dropout |
+| Hardware | Google Colab TPU |
+| Training time | 42 hours |
+---
+## How to Use
+```python
+import onnxruntime as ort
+import numpy as np
+# Load the ONNX model
+session = ort.InferenceSession("model.onnx")
+# Prepare input (see config.yaml for tokenizer settings)
+# The model predicts punctuation labels for each token:
+# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION
+# Example inference
+input_text = "бул кыргыз тилиндеги текст"
+# Tokenize and run inference (see main.py for full pipeline)
+```
+### Repository Structure
+```
+├── model.onnx           # Trained model in ONNX format (1.11 GB)
+├── main.py              # Inference pipeline
+├── env.py               # Environment configuration
+├── config.yaml          # Hyperparameters and model config
+├── requirements.txt     # Python dependencies
+└── Files/               # Additional model files
+```
+---
+## Intended Use
+| Use Case | Description |
+|----------|-------------|
+| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
+| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
+| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
+| **Accessibility** | Enhance readability of automatically generated Kyrgyz content |
+---
+## Limitations
+- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
+- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
+- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
+- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning
+---
+## Future Directions
+- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
+- Morphology-aware tokenization to replace standard BPE
+- Expanded dataset with informal and conversational Kyrgyz text
+- Integration with Kyrgyz ASR systems for end-to-end speech processing
+---
+## Citation
+```bibtex
+@article{uvalieva2024punctuation,
+  author    = {Uvalieva, Zarina and Muhametjanova, Gulshat},
+  title     = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
+  year      = {2024},
+  institution = {Kyrgyz-Turkish Manas University}
+}
+```
+---
+## Author
+**Zarina Uvalieva** — ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
+- 🤗 [HuggingFace](https://huggingface.co/Zarinaaa)
+- 📧 zarina.uvalievaa@gmail.com