File size: 5,636 Bytes
f276518
 
660a0d5
 
 
 
 
 
 
 
 
 
f276518
660a0d5
 
 
 
f276518
660a0d5
 
 
f276518
660a0d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
---
license: apache-2.0
tags:
  - xlm-roberta
  - punctuation-restoration
  - kyrgyz
  - nlp
  - onnx
  - transformer
  - low-resource-languages
  - asr-postprocessing
  - token-classification
language:
  - ky
pipeline_tag: token-classification
datasets:
  - custom
metrics:
  - precision
  - recall
  - f1
---

# Kyrgyz Punctuation Restoration β€” XLM-RoBERTa

**The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β€” surpassing benchmarks for other low-resource languages.

πŸ“„ **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β€” Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)

---

## Highlights

- πŸ† **F1-score: 90.3%** β€” outperforms comparable low-resource language models
- 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
- ⚑ **ONNX format** β€” optimized for fast inference across frameworks
- πŸŽ™οΈ **ASR post-processing** β€” designed to restore punctuation in speech-to-text output

---

## Performance

| Metric | Score |
|--------|-------|
| **Precision** | 94.1% |
| **Recall** | 86.8% |
| **F1-Score** | 90.3% |

### Cross-Lingual Comparison

| Model | Language | F1-Score |
|-------|----------|----------|
| **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
| Alam et al. (2020) | English (clean) | 87.0% |
| Alam et al. (2020) | Bangla | 69.5% |
| Nagy et al. (2021) | Hungarian | ~82.0% |

The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.

---

## Model Architecture

| Parameter | Value |
|-----------|-------|
| Base model | XLM-RoBERTa-base |
| Parameters | ~270M |
| Transformer layers | 12 |
| Hidden dimensions | 768 |
| Attention heads | 12 |
| Export format | ONNX |

---

## Training Details

### Dataset

A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:

| Source | Size | Description |
|--------|------|-------------|
| Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
| Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
| News portals | 25 MB | Journalistic text |

**Preprocessing pipeline:** PDF β†’ EasyOCR text extraction β†’ manual cleaning β†’ JSON formatting with punctuation labels.

### Data Augmentation

Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:

- **Back-translation:** Kyrgyz β†’ English β†’ Kyrgyz (simulating ASR-like errors)
- **Token-level modifications:** Random insertions, deletions, swaps
- **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness

### Hyperparameters

| Parameter | Value |
|-----------|-------|
| Batch size | 32 |
| Epochs | 10 |
| Optimizer | Adam |
| Learning rate | 5e-5 |
| Regularization | Dropout |
| Hardware | Google Colab TPU |
| Training time | 42 hours |

---

## How to Use

```python
import onnxruntime as ort
import numpy as np

# Load the ONNX model
session = ort.InferenceSession("model.onnx")

# Prepare input (see config.yaml for tokenizer settings)
# The model predicts punctuation labels for each token:
# O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION

# Example inference
input_text = "Π±ΡƒΠ» ΠΊΡ‹Ρ€Π³Ρ‹Π· Ρ‚ΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ тСкст"
# Tokenize and run inference (see main.py for full pipeline)
```

### Repository Structure

```
β”œβ”€β”€ model.onnx           # Trained model in ONNX format (1.11 GB)
β”œβ”€β”€ main.py              # Inference pipeline
β”œβ”€β”€ env.py               # Environment configuration
β”œβ”€β”€ config.yaml          # Hyperparameters and model config
β”œβ”€β”€ requirements.txt     # Python dependencies
└── Files/               # Additional model files
```

---

## Intended Use

| Use Case | Description |
|----------|-------------|
| **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
| **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
| **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
| **Accessibility** | Enhance readability of automatically generated Kyrgyz content |

---

## Limitations

- **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
- **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
- **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
- **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning

---

## Future Directions

- Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
- Morphology-aware tokenization to replace standard BPE
- Expanded dataset with informal and conversational Kyrgyz text
- Integration with Kyrgyz ASR systems for end-to-end speech processing

---

## Citation

```bibtex
@article{uvalieva2024punctuation,
  author    = {Uvalieva, Zarina and Muhametjanova, Gulshat},
  title     = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
  year      = {2024},
  institution = {Kyrgyz-Turkish Manas University}
}
```

---

## Author

**Zarina Uvalieva** β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

- πŸ€— [HuggingFace](https://huggingface.co/Zarinaaa)
- πŸ“§ zarina.uvalievaa@gmail.com