File size: 5,636 Bytes

---
license: apache-2.0
tags:
  - xlm-roberta
  - punctuation-restoration
  - kyrgyz
  - nlp
  - onnx
  - transformer
  - low-resource-languages
  - asr-postprocessing
  - token-classification
language:
  - ky
pipeline_tag: token-classification
datasets:
  - custom
metrics:
  - precision
  - recall
  - f1
---

# Kyrgyz Punctuation Restoration — XLM-RoBERTa

**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** — surpassing benchmarks for other low-resource languages.

📄 **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* — Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)

---

## Highlights

- 🏆 **F1-score: 90.3%** — outperforms comparable low-resource language models
- 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
- ⚡ **ONNX format** — optimized for fast inference across frameworks
- 🎙️ **ASR post-processing** — designed to restore punctuation in speech-to-text output

---

## Performance

| Metric | Score |
|--------|-------|
| **Precision** | 94.1% |
| **Recall** | 86.8% |
| **F1-Score** | 90.3% |

### Cross-Lingual Comparison

| Model | Language | F1-Score |
|-------|----------|----------|
| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
| Alam et al. (2020) | English (clean) | 87.0% |
| Alam et al. (2020) | Bangla | 69.5% |
| Nagy et al. (2021) | Hungarian | ~82.0% |

The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.

---

## Model Architecture

| Parameter | Value |
|-----------|-------|
| Base model | XLM-RoBERTa-base |
| Parameters | ~270M |
| Transformer layers | 12 |
| Hidden dimensions | 768 |
| Attention heads | 12 |
| Export format | ONNX |

---

## Training Details

### Dataset

A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:

| Source | Size | Description |
|--------|------|-------------|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
| News portals | 25 MB | Journalistic text |

**Preprocessing pipeline:** PDF → EasyOCR text extraction → manual cleaning → JSON formatting with punctuation labels.

### Data Augmentation

Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:

- **Back-translation:** Kyrgyz → English → Kyrgyz (simulating ASR-like errors)
- **Token-level modifications:** Random insertions, deletions, swaps
- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Batch size | 32 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning rate | 5e-5 |
| Regularization | Dropout |
| Hardware | Google Colab TPU |
| Training time | 42 hours |

---

## How to Use

```python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION

# Example inference
input_text = "бул кыргыз тилиндеги текст"
# Tokenize and run inference (see main.py for full pipeline)
```

### Repository Structure

```
├── model.onnx           # Trained model in ONNX format (1.11 GB)
├── main.py              # Inference pipeline
├── env.py               # Environment configuration
├── config.yaml          # Hyperparameters and model config
├── requirements.txt     # Python dependencies
└── Files/               # Additional model files
```

---

## Intended Use

| Use Case | Description |
|----------|-------------|
| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
| **Accessibility** | Enhance readability of automatically generated Kyrgyz content |

---

## Limitations

- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning

---

## Future Directions

- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
- Morphology-aware tokenization to replace standard BPE
- Expanded dataset with informal and conversational Kyrgyz text
- Integration with Kyrgyz ASR systems for end-to-end speech processing

---

## Citation

```bibtex
@article{uvalieva2024punctuation,
  author    = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title     = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
  year      = {2024},
  institution = {Kyrgyz-Turkish Manas University}
}
```

---

## Author

**Zarina Uvalieva** — ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

- 🤗 [HuggingFace](https://huggingface.co/Zarinaaa)
- 📧 zarina.uvalievaa@gmail.com