File size: 3,734 Bytes
79d9c36 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 |
---
license: mit
tags:
- bert
- morphological-analysis
- kyrgyz
- nlp
- pos-tagging
- low-resource-languages
- token-classification
language:
- ky
pipeline_tag: token-classification
---
# Kyrgyz Morphological Analysis β BERT
<p align="center">
<img src="image_2023-05-13_16-58-05.png" alt="Morphological analysis example" width="600"/>
</p>
## Model Description
A **BERT-based morphological analyzer** for the **Kyrgyz language** β a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence.
Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks.
## Performance
| Model | Accuracy |
|-------|----------|
| **BERT (fine-tuned)** | **~80%** |
| Logistic Regression (baseline) | β |
<!-- π§ TODO: Add baseline accuracy if available -->
## Intended Use
| Use Case | Description |
|----------|-------------|
| **Kyrgyz NLP pipeline** | Morphological preprocessing for machine translation, text analysis |
| **Linguistic research** | Studying Kyrgyz grammar and morphological patterns |
| **Education** | Teaching Kyrgyz morphology with automated analysis |
| **Downstream tasks** | Improving NER, dependency parsing, and sentiment analysis for Kyrgyz |
## Training Details
### Dataset
- **Format:** CSV with morphological annotations
- **Train set:** `train_fixed.csv`
- **Test set:** `test_fixed.csv`
- **Tag set:** Defined in `TAG.docx` (morphological tag inventory)
### Architecture
- **Base model:** BERT (fine-tuned for token classification)
- **Custom variant:** `bert_model_variant.py`
- **Baseline:** Logistic Regression (`logistic_regression.ipynb`)
### Framework
- Python 3.10+
- PyTorch / Transformers (HuggingFace)
## Repository Structure
```
βββ bert_model_variant.py # Custom BERT model architecture
βββ train.py # Training script
βββ dev.py # Evaluation script
βββ dev.ipynb # Development notebook
βββ logistic_regression.ipynb # Baseline model
βββ train_fixed.csv # Training data
βββ test_fixed.csv # Test data
βββ TAG.docx # Morphological tag definitions
```
## How to Use
```python
# Load and run inference
from bert_model_variant import MorphAnalyzer # adjust import as needed
# Example: Analyze Kyrgyz text
text = "ΠΡΡΠ³ΡΠ·ΡΡΠ°Π½ β ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©"
# See train.py and dev.py for full inference pipeline
```
<!-- π§ TODO: Add a more complete inference example -->
## Why This Matters
Kyrgyz is an **underrepresented language** in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to:
- Building foundational NLP tools for the Kyrgyz language
- Enabling more complex downstream applications (MT, QA, summarization)
- Preserving and digitizing Kyrgyz linguistic knowledge
## Limitations
- Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged
- Performance may vary across different text domains and registers
- Limited to the morphological tag set defined in `TAG.docx`
## Citation
```bibtex
@misc{kyrgyz_morph_2023,
author = {Zarina},
title = {BERT-based Morphological Analyzer for Kyrgyz Language},
year = {2023},
url = {https://huggingface.co/Zarinaaa/morphological_analysis}
}
```
## Author
**Zarina** β ML Engineer specializing in NLP and Speech Technologies for low-resource languages.
- π€ [HuggingFace](https://huggingface.co/Zarinaaa)
- πΌ [LinkedIn](https://linkedin.com/in/YOUR_LINKEDIN) |