punctuator_model / README.md
Zarinaaa's picture
update
660a0d5 verified
---
license: apache-2.0
tags:
- xlm-roberta
- punctuation-restoration
- kyrgyz
- nlp
- onnx
- transformer
- low-resource-languages
- asr-postprocessing
- token-classification
language:
- ky
pipeline_tag: token-classification
datasets:
- custom
metrics:
- precision
- recall
- f1
---
# Kyrgyz Punctuation Restoration β€” XLM-RoBERTa
**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β€” surpassing benchmarks for other low-resource languages.
πŸ“„ **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β€” Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)
---
## Highlights
- πŸ† **F1-score: 90.3%** β€” outperforms comparable low-resource language models
- 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
- ⚑ **ONNX format** β€” optimized for fast inference across frameworks
- πŸŽ™οΈ **ASR post-processing** β€” designed to restore punctuation in speech-to-text output
---
## Performance
| Metric | Score |
|--------|-------|
| **Precision** | 94.1% |
| **Recall** | 86.8% |
| **F1-Score** | 90.3% |
### Cross-Lingual Comparison
| Model | Language | F1-Score |
|-------|----------|----------|
| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
| Alam et al. (2020) | English (clean) | 87.0% |
| Alam et al. (2020) | Bangla | 69.5% |
| Nagy et al. (2021) | Hungarian | ~82.0% |
The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.
---
## Model Architecture
| Parameter | Value |
|-----------|-------|
| Base model | XLM-RoBERTa-base |
| Parameters | ~270M |
| Transformer layers | 12 |
| Hidden dimensions | 768 |
| Attention heads | 12 |
| Export format | ONNX |
---
## Training Details
### Dataset
A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:
| Source | Size | Description |
|--------|------|-------------|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
| News portals | 25 MB | Journalistic text |
**Preprocessing pipeline:** PDF β†’ EasyOCR text extraction β†’ manual cleaning β†’ JSON formatting with punctuation labels.
### Data Augmentation
Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:
- **Back-translation:** Kyrgyz β†’ English β†’ Kyrgyz (simulating ASR-like errors)
- **Token-level modifications:** Random insertions, deletions, swaps
- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness
### Hyperparameters
| Parameter | Value |
|-----------|-------|
| Batch size | 32 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning rate | 5e-5 |
| Regularization | Dropout |
| Hardware | Google Colab TPU |
| Training time | 42 hours |
---
## How to Use
```python
import onnxruntime as ort
import numpy as np
# Load the ONNX model
session = ort.InferenceSession("model.onnx")
# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION
# Example inference
input_text = "Π±ΡƒΠ» ΠΊΡ‹Ρ€Π³Ρ‹Π· Ρ‚ΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ тСкст"
# Tokenize and run inference (see main.py for full pipeline)
```
### Repository Structure
```
β”œβ”€β”€ model.onnx # Trained model in ONNX format (1.11 GB)
β”œβ”€β”€ main.py # Inference pipeline
β”œβ”€β”€ env.py # Environment configuration
β”œβ”€β”€ config.yaml # Hyperparameters and model config
β”œβ”€β”€ requirements.txt # Python dependencies
└── Files/ # Additional model files
```
---
## Intended Use
| Use Case | Description |
|----------|-------------|
| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
| **Accessibility** | Enhance readability of automatically generated Kyrgyz content |
---
## Limitations
- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning
---
## Future Directions
- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
- Morphology-aware tokenization to replace standard BPE
- Expanded dataset with informal and conversational Kyrgyz text
- Integration with Kyrgyz ASR systems for end-to-end speech processing
---
## Citation
```bibtex
@article{uvalieva2024punctuation,
author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
year = {2024},
institution = {Kyrgyz-Turkish Manas University}
}
```
---
## Author
**Zarina Uvalieva** β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
- πŸ€— [HuggingFace](https://huggingface.co/Zarinaaa)
- πŸ“§ zarina.uvalievaa@gmail.com