--- license: apache-2.0 tags: - xlm-roberta - punctuation-restoration - kyrgyz - nlp - onnx - transformer - low-resource-languages - asr-postprocessing - token-classification language: - ky pipeline_tag: token-classification datasets: - custom metrics: - precision - recall - f1 --- # Kyrgyz Punctuation Restoration β€” XLM-RoBERTa **The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β€” surpassing benchmarks for other low-resource languages. πŸ“„ **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β€” Uvalieva Z., Muhametjanova G. (SCOPUS-indexed) --- ## Highlights - πŸ† **F1-score: 90.3%** β€” outperforms comparable low-resource language models - 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers) - ⚑ **ONNX format** β€” optimized for fast inference across frameworks - πŸŽ™οΈ **ASR post-processing** β€” designed to restore punctuation in speech-to-text output --- ## Performance | Metric | Score | |--------|-------| | **Precision** | 94.1% | | **Recall** | 86.8% | | **F1-Score** | 90.3% | ### Cross-Lingual Comparison | Model | Language | F1-Score | |-------|----------|----------| | **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** | | Alam et al. (2020) | English (clean) | 87.0% | | Alam et al. (2020) | Bangla | 69.5% | | Nagy et al. (2021) | Hungarian | ~82.0% | The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance. --- ## Model Architecture | Parameter | Value | |-----------|-------| | Base model | XLM-RoBERTa-base | | Parameters | ~270M | | Transformer layers | 12 | | Hidden dimensions | 768 | | Attention heads | 12 | | Export format | ONNX | --- ## Training Details ### Dataset A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months: | Source | Size | Description | |--------|------|-------------| | Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) | | Kyrgyz Wikipedia | 40 MB | Encyclopedia articles | | News portals | 25 MB | Journalistic text | **Preprocessing pipeline:** PDF β†’ EasyOCR text extraction β†’ manual cleaning β†’ JSON formatting with punctuation labels. ### Data Augmentation Specialized augmentation techniques designed for Kyrgyz agglutinative morphology: - **Back-translation:** Kyrgyz β†’ English β†’ Kyrgyz (simulating ASR-like errors) - **Token-level modifications:** Random insertions, deletions, swaps - **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness ### Hyperparameters | Parameter | Value | |-----------|-------| | Batch size | 32 | | Epochs | 10 | | Optimizer | Adam | | Learning rate | 5e-5 | | Regularization | Dropout | | Hardware | Google Colab TPU | | Training time | 42 hours | --- ## How to Use ```python import onnxruntime as ort import numpy as np # Load the ONNX model session = ort.InferenceSession("model.onnx") # Prepare input (see config.yaml for tokenizer settings) # The model predicts punctuation labels for each token: # O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION # Example inference input_text = "Π±ΡƒΠ» ΠΊΡ‹Ρ€Π³Ρ‹Π· Ρ‚ΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ тСкст" # Tokenize and run inference (see main.py for full pipeline) ``` ### Repository Structure ``` β”œβ”€β”€ model.onnx # Trained model in ONNX format (1.11 GB) β”œβ”€β”€ main.py # Inference pipeline β”œβ”€β”€ env.py # Environment configuration β”œβ”€β”€ config.yaml # Hyperparameters and model config β”œβ”€β”€ requirements.txt # Python dependencies └── Files/ # Additional model files ``` --- ## Intended Use | Use Case | Description | |----------|-------------| | **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz | | **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation | | **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) | | **Accessibility** | Enhance readability of automatically generated Kyrgyz content | --- ## Limitations - **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data - **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower - **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions - **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning --- ## Future Directions - Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer - Morphology-aware tokenization to replace standard BPE - Expanded dataset with informal and conversational Kyrgyz text - Integration with Kyrgyz ASR systems for end-to-end speech processing --- ## Citation ```bibtex @article{uvalieva2024punctuation, author = {Uvalieva, Zarina and Muhametjanova, Gulshat}, title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language}, year = {2024}, institution = {Kyrgyz-Turkish Manas University} } ``` --- ## Author **Zarina Uvalieva** β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages. - πŸ€— [HuggingFace](https://huggingface.co/Zarinaaa) - πŸ“§ zarina.uvalievaa@gmail.com