Kyrgyz Morphological Analysis β BERT
Model Description
A BERT-based morphological analyzer for the Kyrgyz language β a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence.
Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks.
Performance
| Model | Accuracy |
|---|---|
| BERT (fine-tuned) | ~80% |
| Logistic Regression (baseline) | β |
Intended Use
| Use Case | Description |
|---|---|
| Kyrgyz NLP pipeline | Morphological preprocessing for machine translation, text analysis |
| Linguistic research | Studying Kyrgyz grammar and morphological patterns |
| Education | Teaching Kyrgyz morphology with automated analysis |
| Downstream tasks | Improving NER, dependency parsing, and sentiment analysis for Kyrgyz |
Training Details
Dataset
- Format: CSV with morphological annotations
- Train set:
train_fixed.csv - Test set:
test_fixed.csv - Tag set: Defined in
TAG.docx(morphological tag inventory)
Architecture
- Base model: BERT (fine-tuned for token classification)
- Custom variant:
bert_model_variant.py - Baseline: Logistic Regression (
logistic_regression.ipynb)
Framework
- Python 3.10+
- PyTorch / Transformers (HuggingFace)
Repository Structure
βββ bert_model_variant.py # Custom BERT model architecture
βββ train.py # Training script
βββ dev.py # Evaluation script
βββ dev.ipynb # Development notebook
βββ logistic_regression.ipynb # Baseline model
βββ train_fixed.csv # Training data
βββ test_fixed.csv # Test data
βββ TAG.docx # Morphological tag definitions
How to Use
# Load and run inference
from bert_model_variant import MorphAnalyzer # adjust import as needed
# Example: Analyze Kyrgyz text
text = "ΠΡΡΠ³ΡΠ·ΡΡΠ°Π½ β ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©"
# See train.py and dev.py for full inference pipeline
Why This Matters
Kyrgyz is an underrepresented language in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to:
- Building foundational NLP tools for the Kyrgyz language
- Enabling more complex downstream applications (MT, QA, summarization)
- Preserving and digitizing Kyrgyz linguistic knowledge
Limitations
- Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged
- Performance may vary across different text domains and registers
- Limited to the morphological tag set defined in
TAG.docx
Citation
@misc{kyrgyz_morph_2023,
author = {Zarina},
title = {BERT-based Morphological Analyzer for Kyrgyz Language},
year = {2023},
url = {https://huggingface.co/Zarinaaa/morphological_analysis}
}
Author
Zarina β ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
- π€ HuggingFace
- πΌ LinkedIn