File size: 3,734 Bytes
79d9c36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: mit
tags:
  - bert
  - morphological-analysis
  - kyrgyz
  - nlp
  - pos-tagging
  - low-resource-languages
  - token-classification
language:
  - ky
pipeline_tag: token-classification
---

# Kyrgyz Morphological Analysis β€” BERT

<p align="center">
  <img src="image_2023-05-13_16-58-05.png" alt="Morphological analysis example" width="600"/>
</p>

## Model Description

A **BERT-based morphological analyzer** for the **Kyrgyz language** β€” a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence.

Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks.

## Performance

| Model | Accuracy |
|-------|----------|
| **BERT (fine-tuned)** | **~80%** |
| Logistic Regression (baseline) | β€” |

<!-- πŸ”§ TODO: Add baseline accuracy if available -->

## Intended Use

| Use Case | Description |
|----------|-------------|
| **Kyrgyz NLP pipeline** | Morphological preprocessing for machine translation, text analysis |
| **Linguistic research** | Studying Kyrgyz grammar and morphological patterns |
| **Education** | Teaching Kyrgyz morphology with automated analysis |
| **Downstream tasks** | Improving NER, dependency parsing, and sentiment analysis for Kyrgyz |

## Training Details

### Dataset

- **Format:** CSV with morphological annotations
- **Train set:** `train_fixed.csv`
- **Test set:** `test_fixed.csv`
- **Tag set:** Defined in `TAG.docx` (morphological tag inventory)

### Architecture

- **Base model:** BERT (fine-tuned for token classification)
- **Custom variant:** `bert_model_variant.py`
- **Baseline:** Logistic Regression (`logistic_regression.ipynb`)

### Framework

- Python 3.10+
- PyTorch / Transformers (HuggingFace)

## Repository Structure

```
β”œβ”€β”€ bert_model_variant.py    # Custom BERT model architecture
β”œβ”€β”€ train.py                 # Training script
β”œβ”€β”€ dev.py                   # Evaluation script
β”œβ”€β”€ dev.ipynb                # Development notebook
β”œβ”€β”€ logistic_regression.ipynb # Baseline model
β”œβ”€β”€ train_fixed.csv          # Training data
β”œβ”€β”€ test_fixed.csv           # Test data
β”œβ”€β”€ TAG.docx                 # Morphological tag definitions
```

## How to Use

```python
# Load and run inference
from bert_model_variant import MorphAnalyzer  # adjust import as needed

# Example: Analyze Kyrgyz text
text = "ΠšΡ‹Ρ€Π³Ρ‹Π·ΡΡ‚Π°Π½ β€” ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©"
# See train.py and dev.py for full inference pipeline
```

<!-- πŸ”§ TODO: Add a more complete inference example -->

## Why This Matters

Kyrgyz is an **underrepresented language** in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to:

- Building foundational NLP tools for the Kyrgyz language
- Enabling more complex downstream applications (MT, QA, summarization)
- Preserving and digitizing Kyrgyz linguistic knowledge

## Limitations

- Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged
- Performance may vary across different text domains and registers
- Limited to the morphological tag set defined in `TAG.docx`

## Citation

```bibtex
@misc{kyrgyz_morph_2023,
  author = {Zarina},
  title  = {BERT-based Morphological Analyzer for Kyrgyz Language},
  year   = {2023},
  url    = {https://huggingface.co/Zarinaaa/morphological_analysis}
}
```

## Author

**Zarina** β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

- πŸ€— [HuggingFace](https://huggingface.co/Zarinaaa)
- πŸ’Ό [LinkedIn](https://linkedin.com/in/YOUR_LINKEDIN)