TatarMorphAnalyzer / README.md
ArabovMK's picture
Update README.md
da457d3 verified

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade
metadata
title: Tatar Morphological Analyzer
emoji: 🔤
colorFrom: blue
colorTo: green
sdk: streamlit
pinned: true
app_file: app.py
license: mit
short_description: Interactive demo for 5 state-of-the-art Tagger models
sdk_version: 1.55.0

🔤 Tatar Morphological Analyzer

Interactive exploration of five fine‑tuned models for Tagger morphological tagging

Compare mBERT, RuBERT, DistilBERT, XLM‑R, and Turkish BERT on your own sentences

🤗 Hugging Face License Models Streamlit

🌟 Overview

This Space provides a unified interface to five state‑of‑the‑art models for morphological analysis of the Tatar language. Each model predicts full morphological tags (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.

Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.

🚀 Key Features

🧩 Model Selection

  • Switch between 5 different transformer models:
    • mBERT (best overall accuracy)
    • RuBERT (excellent due to Russian proximity)
    • DistilBERT (lightweight, fast)
    • XLM‑R (powerful multilingual)
    • Turkish BERT (good baseline)

🔍 Interactive Analysis

  • Real‑time tagging: Get token‑level morphological tags with confidence scores.
  • Visual badges: Colour‑coded display of tags for quick scanning.
  • Example sentences: Pre‑loaded examples for instant testing.

📊 Model Metrics

  • For each model, you can see its token accuracy, F1‑micro, and F1‑macro directly in the sidebar.

🎮 Quick Start Examples

Try These Sentences

Language Sentence
Tatar Мин татарча сөйләшәм.
Tatar Кичә мин дусларым белән паркка бардым.
Tatar Татарстан – Россия Федерациясе составындагы республика.

Just paste any sentence into the text box and click Analyze!

Expected Output (for mBERT on the first sentence)

Word Morphological Tag Confidence
Мин Pron+Sg+Nom+Pers(1) 0.999
татарча Adv 0.998
сөйләшәм V+Pres+1 0.997
. PUNCT 1.000

📈 Model Performance Comparison

Model Token Accuracy F1‑micro F1‑macro Speed (sent./sec)
mBERT 98.68% 98.68% 50.94% 150
RuBERT 98.13% 98.13% 47.37% 150
DistilBERT 97.98% 97.98% 44.02% 250
XLM‑R 97.67% 97.67% 40.61% 120
Turkish BERT 86.84% 86.84% 33.34% 150

Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the results/ folder of each model repository.

🏗️ Technical Architecture

Models

All models are fine‑tuned from popular transformer checkpoints:

Model Base Checkpoint Parameters
mBERT bert-base-multilingual-cased ~180M
RuBERT DeepPavlov/rubert-base-cased ~180M
DistilBERT distilbert-base-multilingual-cased ~134M
XLM‑R xlm-roberta-base ~560M
Turkish BERT dbmdz/bert-base-turkish-cased ~180M

Training Data

  • Dataset: TatarNLPWorld/tatar-morphological-corpus
  • Training subset: 60,000 sentences (shuffled seed 42, filtered → 59,992)
  • Split: Train 47,993 / Validation 5,999 / Test 6,000
  • Tagset: 1,181 unique morphological tags (e.g., N+Sg+Nom, V+Past+3, PUNCT)

Hyperparameters (common to all models)

Parameter Value
Learning rate 2e-5
Optimizer AdamW (wd=0.01)
Warmup steps 500
Number of epochs 4
Max sequence length 128
Mixed precision FP16

Training Hardware

  • GPU: NVIDIA Tesla V100 (32GB)
  • Training time: 4–8 hours per model
  • Inference speed: Varies by model (see table above)

📦 Repository Structure

.
├── app.py                  # Main Streamlit application
├── requirements.txt        # Python dependencies
├── .streamlit/
│   └── config.toml        # Streamlit server config (port 7860)
├── results/                # (optional) Additional metrics and plots
└── README.md              # This file

🚀 Local Deployment

# Clone the Space
git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
cd tatar-morph-analyzer

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py --server.port 8501

The app will be available at http://localhost:8501.

📜 Citation

If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:

@misc{tatar-morph-analyzer,
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
  title = {Tatar Morphological Analyzer – Interactive Demo},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
}

📄 License

The code in this Space is released under the MIT License. Each model retains its own license (all are Apache 2.0 or MIT).

🙏 Acknowledgments


Explore and advance Tagger language technology

Brought to you by TatarNLPWorld

Report IssueRequest FeatureContact