Zarinaaa commited on
Commit
660a0d5
Β·
verified Β·
1 Parent(s): f276518
Files changed (1) hide show
  1. README.md +184 -3
README.md CHANGED
@@ -1,8 +1,189 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  language:
4
- - ky
 
 
 
5
  metrics:
6
- - accuracy
 
 
7
  ---
8
- # hackaton_generative_ai
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ tags:
4
+ - xlm-roberta
5
+ - punctuation-restoration
6
+ - kyrgyz
7
+ - nlp
8
+ - onnx
9
+ - transformer
10
+ - low-resource-languages
11
+ - asr-postprocessing
12
+ - token-classification
13
  language:
14
+ - ky
15
+ pipeline_tag: token-classification
16
+ datasets:
17
+ - custom
18
  metrics:
19
+ - precision
20
+ - recall
21
+ - f1
22
  ---
23
+
24
+ # Kyrgyz Punctuation Restoration β€” XLM-RoBERTa
25
+
26
+ **The first punctuation restoration model for the Kyrgyz language**, achieving **94.1% precision** and **90.3% F1-score** β€” surpassing benchmarks for other low-resource languages.
27
+
28
+ πŸ“„ **Published research:** *"AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language"* β€” Uvalieva Z., Muhametjanova G. (SCOPUS-indexed)
29
+
30
+ ---
31
+
32
+ ## Highlights
33
+
34
+ - πŸ† **F1-score: 90.3%** β€” outperforms comparable low-resource language models
35
+ - 🌍 **First-of-its-kind** for Kyrgyz (Turkic language family, ~7M speakers)
36
+ - ⚑ **ONNX format** β€” optimized for fast inference across frameworks
37
+ - πŸŽ™οΈ **ASR post-processing** β€” designed to restore punctuation in speech-to-text output
38
+
39
+ ---
40
+
41
+ ## Performance
42
+
43
+ | Metric | Score |
44
+ |--------|-------|
45
+ | **Precision** | 94.1% |
46
+ | **Recall** | 86.8% |
47
+ | **F1-Score** | 90.3% |
48
+
49
+ ### Cross-Lingual Comparison
50
+
51
+ | Model | Language | F1-Score |
52
+ |-------|----------|----------|
53
+ | **Ours (XLM-RoBERTa)** | **Kyrgyz** | **90.3%** |
54
+ | Alam et al. (2020) | English (clean) | 87.0% |
55
+ | Alam et al. (2020) | Bangla | 69.5% |
56
+ | Nagy et al. (2021) | Hungarian | ~82.0% |
57
+
58
+ The model demonstrates strong performance on frequent punctuation marks (periods, commas) with reduced accuracy on rare marks (question marks, exclamation points) due to class imbalance.
59
+
60
+ ---
61
+
62
+ ## Model Architecture
63
+
64
+ | Parameter | Value |
65
+ |-----------|-------|
66
+ | Base model | XLM-RoBERTa-base |
67
+ | Parameters | ~270M |
68
+ | Transformer layers | 12 |
69
+ | Hidden dimensions | 768 |
70
+ | Attention heads | 12 |
71
+ | Export format | ONNX |
72
+
73
+ ---
74
+
75
+ ## Training Details
76
+
77
+ ### Dataset
78
+
79
+ A custom-built **200 MB** Kyrgyz text corpus, collected over 2 months:
80
+
81
+ | Source | Size | Description |
82
+ |--------|------|-------------|
83
+ | Kyrgyz-Turkish Manas University Library | 135 MB | Books (literature, math, physics) |
84
+ | Kyrgyz Wikipedia | 40 MB | Encyclopedia articles |
85
+ | News portals | 25 MB | Journalistic text |
86
+
87
+ **Preprocessing pipeline:** PDF β†’ EasyOCR text extraction β†’ manual cleaning β†’ JSON formatting with punctuation labels.
88
+
89
+ ### Data Augmentation
90
+
91
+ Specialized augmentation techniques designed for Kyrgyz agglutinative morphology:
92
+
93
+ - **Back-translation:** Kyrgyz β†’ English β†’ Kyrgyz (simulating ASR-like errors)
94
+ - **Token-level modifications:** Random insertions, deletions, swaps
95
+ - **Morphological transformations:** Case form and morpheme modifications preserving grammatical correctness
96
+
97
+ ### Hyperparameters
98
+
99
+ | Parameter | Value |
100
+ |-----------|-------|
101
+ | Batch size | 32 |
102
+ | Epochs | 10 |
103
+ | Optimizer | Adam |
104
+ | Learning rate | 5e-5 |
105
+ | Regularization | Dropout |
106
+ | Hardware | Google Colab TPU |
107
+ | Training time | 42 hours |
108
+
109
+ ---
110
+
111
+ ## How to Use
112
+
113
+ ```python
114
+ import onnxruntime as ort
115
+ import numpy as np
116
+
117
+ # Load the ONNX model
118
+ session = ort.InferenceSession("model.onnx")
119
+
120
+ # Prepare input (see config.yaml for tokenizer settings)
121
+ # The model predicts punctuation labels for each token:
122
+ # O (no punctuation), COMMA, PERIOD, QUESTION, EXCLAMATION
123
+
124
+ # Example inference
125
+ input_text = "Π±ΡƒΠ» ΠΊΡ‹Ρ€Π³Ρ‹Π· Ρ‚ΠΈΠ»ΠΈΠ½Π΄Π΅Π³ΠΈ тСкст"
126
+ # Tokenize and run inference (see main.py for full pipeline)
127
+ ```
128
+
129
+ ### Repository Structure
130
+
131
+ ```
132
+ β”œβ”€β”€ model.onnx # Trained model in ONNX format (1.11 GB)
133
+ β”œβ”€β”€ main.py # Inference pipeline
134
+ β”œβ”€β”€ env.py # Environment configuration
135
+ β”œβ”€β”€ config.yaml # Hyperparameters and model config
136
+ β”œβ”€β”€ requirements.txt # Python dependencies
137
+ └── Files/ # Additional model files
138
+ ```
139
+
140
+ ---
141
+
142
+ ## Intended Use
143
+
144
+ | Use Case | Description |
145
+ |----------|-------------|
146
+ | **ASR post-processing** | Restore punctuation in speech-to-text output for Kyrgyz |
147
+ | **Text normalization** | Clean and format raw Kyrgyz text with proper punctuation |
148
+ | **NLP preprocessing** | Improve downstream task performance (NER, MT, summarization) |
149
+ | **Accessibility** | Enhance readability of automatically generated Kyrgyz content |
150
+
151
+ ---
152
+
153
+ ## Limitations
154
+
155
+ - **Rare punctuation marks:** Lower accuracy on question marks and exclamation points due to class imbalance in training data
156
+ - **Formal text bias:** Trained primarily on literary/formal text; performance on informal/conversational text (social media, chat) may be lower
157
+ - **Morpheme boundary errors:** Occasional difficulty placing punctuation in complex agglutinative constructions
158
+ - **Domain specificity:** Best performance on prose-style text; specialized domains may require additional fine-tuning
159
+
160
+ ---
161
+
162
+ ## Future Directions
163
+
164
+ - Joint training with related Turkic languages (Kazakh, Uzbek, Turkish) for improved cross-lingual transfer
165
+ - Morphology-aware tokenization to replace standard BPE
166
+ - Expanded dataset with informal and conversational Kyrgyz text
167
+ - Integration with Kyrgyz ASR systems for end-to-end speech processing
168
+
169
+ ---
170
+
171
+ ## Citation
172
+
173
+ ```bibtex
174
+ @article{uvalieva2024punctuation,
175
+ author = {Uvalieva, Zarina and Muhametjanova, Gulshat},
176
+ title = {AI-Based Punctuation Restoration using Transformer Model for Kyrgyz Language},
177
+ year = {2024},
178
+ institution = {Kyrgyz-Turkish Manas University}
179
+ }
180
+ ```
181
+
182
+ ---
183
+
184
+ ## Author
185
+
186
+ **Zarina Uvalieva** β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
187
+
188
+ - πŸ€— [HuggingFace](https://huggingface.co/Zarinaaa)
189
+ - πŸ“§ zarina.uvalievaa@gmail.com