Zarinaaa's picture
readme
79d9c36 verified
metadata
license: mit
tags:
  - bert
  - morphological-analysis
  - kyrgyz
  - nlp
  - pos-tagging
  - low-resource-languages
  - token-classification
language:
  - ky
pipeline_tag: token-classification

Kyrgyz Morphological Analysis β€” BERT

Morphological analysis example

Model Description

A BERT-based morphological analyzer for the Kyrgyz language β€” a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence.

Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks.

Performance

Model Accuracy
BERT (fine-tuned) ~80%
Logistic Regression (baseline) β€”

Intended Use

Use Case Description
Kyrgyz NLP pipeline Morphological preprocessing for machine translation, text analysis
Linguistic research Studying Kyrgyz grammar and morphological patterns
Education Teaching Kyrgyz morphology with automated analysis
Downstream tasks Improving NER, dependency parsing, and sentiment analysis for Kyrgyz

Training Details

Dataset

  • Format: CSV with morphological annotations
  • Train set: train_fixed.csv
  • Test set: test_fixed.csv
  • Tag set: Defined in TAG.docx (morphological tag inventory)

Architecture

  • Base model: BERT (fine-tuned for token classification)
  • Custom variant: bert_model_variant.py
  • Baseline: Logistic Regression (logistic_regression.ipynb)

Framework

  • Python 3.10+
  • PyTorch / Transformers (HuggingFace)

Repository Structure

β”œβ”€β”€ bert_model_variant.py    # Custom BERT model architecture
β”œβ”€β”€ train.py                 # Training script
β”œβ”€β”€ dev.py                   # Evaluation script
β”œβ”€β”€ dev.ipynb                # Development notebook
β”œβ”€β”€ logistic_regression.ipynb # Baseline model
β”œβ”€β”€ train_fixed.csv          # Training data
β”œβ”€β”€ test_fixed.csv           # Test data
β”œβ”€β”€ TAG.docx                 # Morphological tag definitions

How to Use

# Load and run inference
from bert_model_variant import MorphAnalyzer  # adjust import as needed

# Example: Analyze Kyrgyz text
text = "ΠšΡ‹Ρ€Π³Ρ‹Π·ΡΡ‚Π°Π½ β€” ΠΊΠΎΠΎΠ· Σ©Π»ΠΊΣ©"
# See train.py and dev.py for full inference pipeline

Why This Matters

Kyrgyz is an underrepresented language in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to:

  • Building foundational NLP tools for the Kyrgyz language
  • Enabling more complex downstream applications (MT, QA, summarization)
  • Preserving and digitizing Kyrgyz linguistic knowledge

Limitations

  • Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged
  • Performance may vary across different text domains and registers
  • Limited to the morphological tag set defined in TAG.docx

Citation

@misc{kyrgyz_morph_2023,
  author = {Zarina},
  title  = {BERT-based Morphological Analyzer for Kyrgyz Language},
  year   = {2023},
  url    = {https://huggingface.co/Zarinaaa/morphological_analysis}
}

Author

Zarina β€” ML Engineer specializing in NLP and Speech Technologies for low-resource languages.