|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- bert |
|
|
- morphological-analysis |
|
|
- kyrgyz |
|
|
- nlp |
|
|
- pos-tagging |
|
|
- low-resource-languages |
|
|
- token-classification |
|
|
language: |
|
|
- ky |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Kyrgyz Morphological Analysis β BERT |
|
|
|
|
|
<p align="center"> |
|
|
<img src="image_2023-05-13_16-58-05.png" alt="Morphological analysis example" width="600"/> |
|
|
</p> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
A **BERT-based morphological analyzer** for the **Kyrgyz language** β a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence. |
|
|
|
|
|
Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks. |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Model | Accuracy | |
|
|
|-------|----------| |
|
|
| **BERT (fine-tuned)** | **~80%** | |
|
|
| Logistic Regression (baseline) | β | |
|
|
|
|
|
<!-- π§ TODO: Add baseline accuracy if available --> |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
| Use Case | Description | |
|
|
|----------|-------------| |
|
|
| **Kyrgyz NLP pipeline** | Morphological preprocessing for machine translation, text analysis | |
|
|
| **Linguistic research** | Studying Kyrgyz grammar and morphological patterns | |
|
|
| **Education** | Teaching Kyrgyz morphology with automated analysis | |
|
|
| **Downstream tasks** | Improving NER, dependency parsing, and sentiment analysis for Kyrgyz | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
|
|
|
- **Format:** CSV with morphological annotations |
|
|
- **Train set:** `train_fixed.csv` |
|
|
- **Test set:** `test_fixed.csv` |
|
|
- **Tag set:** Defined in `TAG.docx` (morphological tag inventory) |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Base model:** BERT (fine-tuned for token classification) |
|
|
- **Custom variant:** `bert_model_variant.py` |
|
|
- **Baseline:** Logistic Regression (`logistic_regression.ipynb`) |
|
|
|
|
|
### Framework |
|
|
|
|
|
- Python 3.10+ |
|
|
- PyTorch / Transformers (HuggingFace) |
|
|
|
|
|
## Repository Structure |
|
|
|
|
|
``` |
|
|
βββ bert_model_variant.py # Custom BERT model architecture |
|
|
βββ train.py # Training script |
|
|
βββ dev.py # Evaluation script |
|
|
βββ dev.ipynb # Development notebook |
|
|
βββ logistic_regression.ipynb # Baseline model |
|
|
βββ train_fixed.csv # Training data |
|
|
βββ test_fixed.csv # Test data |
|
|
βββ TAG.docx # Morphological tag definitions |
|
|
``` |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
# Load and run inference |
|
|
from bert_model_variant import MorphAnalyzer # adjust import as needed |
|
|
|
|
|
# Example: Analyze Kyrgyz text |
|
|
text = "ΠΡΡΠ³ΡΠ·ΡΡΠ°Π½ β ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©" |
|
|
# See train.py and dev.py for full inference pipeline |
|
|
``` |
|
|
|
|
|
<!-- π§ TODO: Add a more complete inference example --> |
|
|
|
|
|
## Why This Matters |
|
|
|
|
|
Kyrgyz is an **underrepresented language** in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to: |
|
|
|
|
|
- Building foundational NLP tools for the Kyrgyz language |
|
|
- Enabling more complex downstream applications (MT, QA, summarization) |
|
|
- Preserving and digitizing Kyrgyz linguistic knowledge |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged |
|
|
- Performance may vary across different text domains and registers |
|
|
- Limited to the morphological tag set defined in `TAG.docx` |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{kyrgyz_morph_2023, |
|
|
author = {Zarina}, |
|
|
title = {BERT-based Morphological Analyzer for Kyrgyz Language}, |
|
|
year = {2023}, |
|
|
url = {https://huggingface.co/Zarinaaa/morphological_analysis} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Author |
|
|
|
|
|
**Zarina** β ML Engineer specializing in NLP and Speech Technologies for low-resource languages. |
|
|
|
|
|
- π€ [HuggingFace](https://huggingface.co/Zarinaaa) |
|
|
- πΌ [LinkedIn](https://linkedin.com/in/YOUR_LINKEDIN) |