File size: 5,636 Bytes
f276518 660a0d5 f276518 660a0d5 f276518 660a0d5 f276518 660a0d5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
---
license: apache-2.0
tags:
- xlm-roberta
- punctuation-restoration
- kyrgyz
- nlp
- onnx
- transformer
- low-resource-languages
- asr-postprocessing
- token-classification
language:
- ky
pipeline_tag: token-classification
datasets:
- custom
metrics:
- precision
- recall
- f1
---
# Kyrgyz Punctuation Restoration β XLM-RoBERTa
**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β surpassing benchmarks for other low-resource languages.
π **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)
---
## Highlights
- π **F1-score: 90.3%** β outperforms comparable low-resource language models
- π **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
- β‘ **ONNX format** β optimized for fast inference across frameworks
- ποΈ **ASR post-processing** β designed to restore punctuation in speech-to-text output
---
## Performance
| Metric | Score |
|--------|-------|
| **Precision** | 94.1% |
| **Recall** | 86.8% |
| **F1-Score** | 90.3% |
### Cross-Lingual Comparison
| Model | Language | F1-Score |
|-------|----------|----------|
| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
| Alam et al. (2020) | English (clean) | 87.0% |
| Alam et al. (2020) | Bangla | 69.5% |
| Nagy et al. (2021) | Hungarian | ~82.0% |
The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.
---
## Model Architecture
| Parameter | Value |
|-----------|-------|
| Base model | XLM-RoBERTa-base |
| Parameters | ~270M |
| Transformer layers | 12 |
| Hidden dimensions | 768 |
| Attention heads | 12 |
| Export format | ONNX |
---
## Training Details
### Dataset
A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:
| Source | Size | Description |
|--------|------|-------------|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
| News portals | 25 MB | Journalistic text |
**Preprocessing pipeline:** PDF β EasyOCR text extraction β manual cleaning β JSON formatting with punctuation labels.
### Data Augmentation
Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:
- **Back-translation:** Kyrgyz β English β Kyrgyz (simulating ASR-like errors)
- **Token-level modifications:** Random insertions, deletions, swaps
- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Batch size | 32 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning rate | 5e-5 |
| Regularization | Dropout |
| Hardware | Google Colab TPU |
| Training time | 42 hours |
---
## How to Use
```python
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model.onnx")
# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION
# Example inference
input_text = "Π±ΡΠ» ΠΊΡΡΠ³ΡΠ· ΡΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ ΡΠ΅ΠΊΡΡ"
# Tokenize and run inference (see main.py for full pipeline)
```
### Repository Structure
```
βββ model.onnx # Trained model in ONNX format (1.11 GB)
βββ main.py # Inference pipeline
βββ env.py # Environment configuration
βββ config.yaml # Hyperparameters and model config
βββ requirements.txt # Python dependencies
βββ Files/ # Additional model files
```
---
## Intended Use
| Use Case | Description |
|----------|-------------|
| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
| **Accessibility** | Enhance readability of automatically generated Kyrgyz content |
---
## Limitations
- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning
---
## Future Directions
- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
- Morphology-aware tokenization to replace standard BPE
- Expanded dataset with informal and conversational Kyrgyz text
- Integration with Kyrgyz ASR systems for end-to-end speech processing
---
## Citation
```bibtex
@article{uvalieva2024punctuation,
author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
year = {2024},
institution = {Kyrgyz-Turkish Manas University}
}
```
---
## Author
**Zarina Uvalieva** β ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
- π€ [HuggingFace](https://huggingface.co/Zarinaaa)
- π§ zarina.uvalievaa@gmail.com |