morphological_analysis / README.md

Zarinaaa

readme

79d9c36 verified 4 days ago

preview code

raw

history blame contribute delete

3.73 kB

metadata

license: mit
tags:
  - bert
  - morphological-analysis
  - kyrgyz
  - nlp
  - pos-tagging
  - low-resource-languages
  - token-classification
language:
  - ky
pipeline_tag: token-classification

Kyrgyz Morphological Analysis — BERT

Morphological analysis example

Model Description

A BERT-based morphological analyzer for the Kyrgyz language — a low-resource Turkic language spoken by ~5 million people. The model performs morphological tagging, predicting grammatical features (POS tags, case, number, tense, etc.) for each token in a sentence.

Kyrgyz is an agglutinative language with rich morphology, making morphological analysis particularly challenging and valuable for downstream NLP tasks.

Performance

Model	Accuracy
BERT (fine-tuned)	~80%
Logistic Regression (baseline)	—

Intended Use

Use Case	Description
Kyrgyz NLP pipeline	Morphological preprocessing for machine translation, text analysis
Linguistic research	Studying Kyrgyz grammar and morphological patterns
Education	Teaching Kyrgyz morphology with automated analysis
Downstream tasks	Improving NER, dependency parsing, and sentiment analysis for Kyrgyz

Training Details

Dataset

Format: CSV with morphological annotations
Train set: train_fixed.csv
Test set: test_fixed.csv
Tag set: Defined in TAG.docx (morphological tag inventory)

Architecture

Base model: BERT (fine-tuned for token classification)
Custom variant: bert_model_variant.py
Baseline: Logistic Regression (logistic_regression.ipynb)

Framework

Python 3.10+
PyTorch / Transformers (HuggingFace)

Repository Structure

├── bert_model_variant.py    # Custom BERT model architecture
├── train.py                 # Training script
├── dev.py                   # Evaluation script
├── dev.ipynb                # Development notebook
├── logistic_regression.ipynb # Baseline model
├── train_fixed.csv          # Training data
├── test_fixed.csv           # Test data
├── TAG.docx                 # Morphological tag definitions

How to Use

# Load and run inference
from bert_model_variant import MorphAnalyzer  # adjust import as needed

# Example: Analyze Kyrgyz text
text = "Кыргызстан — кооз өлкө"
# See train.py and dev.py for full inference pipeline

Why This Matters

Kyrgyz is an underrepresented language in NLP. Most morphological analyzers exist only for high-resource languages. This model contributes to:

Building foundational NLP tools for the Kyrgyz language
Enabling more complex downstream applications (MT, QA, summarization)
Preserving and digitizing Kyrgyz linguistic knowledge

Limitations

Accuracy of ~80% means roughly 1 in 5 tokens may be mistagged
Performance may vary across different text domains and registers
Limited to the morphological tag set defined in TAG.docx

Citation

@misc{kyrgyz_morph_2023,
  author = {Zarina},
  title  = {BERT-based Morphological Analyzer for Kyrgyz Language},
  year   = {2023},
  url    = {https://huggingface.co/Zarinaaa/morphological_analysis}
}

Author

Zarina — ML Engineer specializing in NLP and Speech Technologies for low-resource languages.

🤗 HuggingFace
💼 LinkedIn