|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- xlm-roberta |
|
|
- punctuation-restoration |
|
|
- kyrgyz |
|
|
- nlp |
|
|
- onnx |
|
|
- transformer |
|
|
- low-resource-languages |
|
|
- asr-postprocessing |
|
|
- token-classification |
|
|
language: |
|
|
- ky |
|
|
pipeline_tag: token-classification |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# Kyrgyz Punctuation Restoration β XLM-RoBERTa |
|
|
|
|
|
**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β surpassing benchmarks for other low-resource languages. |
|
|
|
|
|
π **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β Uvalieva Z., Muhametjanova G. (SCOPUS-indexed) |
|
|
|
|
|
--- |
|
|
|
|
|
## Highlights |
|
|
|
|
|
- π **F1-score: 90.3%** β outperforms comparable low-resource language models |
|
|
- π **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers) |
|
|
- β‘ **ONNX format** β optimized for fast inference across frameworks |
|
|
- ποΈ **ASR post-processing** β designed to restore punctuation in speech-to-text output |
|
|
|
|
|
--- |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Precision** | 94.1% | |
|
|
| **Recall** | 86.8% | |
|
|
| **F1-Score** | 90.3% | |
|
|
|
|
|
### Cross-Lingual Comparison |
|
|
|
|
|
| Model | Language | F1-Score | |
|
|
|-------|----------|----------| |
|
|
| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** | |
|
|
| Alam et al. (2020) | English (clean) | 87.0% | |
|
|
| Alam et al. (2020) | Bangla | 69.5% | |
|
|
| Nagy et al. (2021) | Hungarian | ~82.0% | |
|
|
|
|
|
The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Base model | XLM-RoBERTa-base | |
|
|
| Parameters | ~270M | |
|
|
| Transformer layers | 12 | |
|
|
| Hidden dimensions | 768 | |
|
|
| Attention heads | 12 | |
|
|
| Export format | ONNX | |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
|
|
|
A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months: |
|
|
|
|
|
| Source | Size | Description | |
|
|
|--------|------|-------------| |
|
|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) | |
|
|
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles | |
|
|
| News portals | 25 MB | Journalistic text | |
|
|
|
|
|
**Preprocessing pipeline:** PDF β EasyOCR text extraction β manual cleaning β JSON formatting with punctuation labels. |
|
|
|
|
|
### Data Augmentation |
|
|
|
|
|
Specialized augmentation techniques designed for Kyrgyz agglutinative morphology: |
|
|
|
|
|
- **Back-translation:** Kyrgyz β English β Kyrgyz (simulating ASR-like errors) |
|
|
- **Token-level modifications:** Random insertions, deletions, swaps |
|
|
- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Batch size | 32 | |
|
|
| Epochs | 10 | |
|
|
| Optimizer | Adam | |
|
|
| Learning rate | 5e-5 | |
|
|
| Regularization | Dropout | |
|
|
| Hardware | Google Colab TPU | |
|
|
| Training time | 42 hours | |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
# Load the ONNX model |
|
|
session = ort.InferenceSession("model.onnx") |
|
|
|
|
|
# Prepare input (see config.yaml for tokenizer settings) |
|
|
# The model predicts punctuation labels for each token: |
|
|
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION |
|
|
|
|
|
# Example inference |
|
|
input_text = "Π±ΡΠ» ΠΊΡΡΠ³ΡΠ· ΡΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ ΡΠ΅ΠΊΡΡ" |
|
|
# Tokenize and run inference (see main.py for full pipeline) |
|
|
``` |
|
|
|
|
|
### Repository Structure |
|
|
|
|
|
``` |
|
|
βββ model.onnx # Trained model in ONNX format (1.11 GB) |
|
|
βββ main.py # Inference pipeline |
|
|
βββ env.py # Environment configuration |
|
|
βββ config.yaml # Hyperparameters and model config |
|
|
βββ requirements.txt # Python dependencies |
|
|
βββ Files/ # Additional model files |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
| Use Case | Description | |
|
|
|----------|-------------| |
|
|
| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz | |
|
|
| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation | |
|
|
| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) | |
|
|
| **Accessibility** | Enhance readability of automatically generated Kyrgyz content | |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data |
|
|
- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower |
|
|
- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions |
|
|
- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning |
|
|
|
|
|
--- |
|
|
|
|
|
## Future Directions |
|
|
|
|
|
- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer |
|
|
- Morphology-aware tokenization to replace standard BPE |
|
|
- Expanded dataset with informal and conversational Kyrgyz text |
|
|
- Integration with Kyrgyz ASR systems for end-to-end speech processing |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{uvalieva2024punctuation, |
|
|
author = {Uvalieva, Zarina and Muhametjanova, Gulshat}, |
|
|
title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language}, |
|
|
year = {2024}, |
|
|
institution = {Kyrgyz-Turkish Manas University} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Author |
|
|
|
|
|
**Zarina Uvalieva** β ML Engineer specializing in NLP and Speech Technologies for low-resource languages. |
|
|
|
|
|
- π€ [HuggingFace](https://huggingface.co/Zarinaaa) |
|
|
- π§ zarina.uvalievaa@gmail.com |