Spaces:
Runtime error
A newer version of the Streamlit SDK is available: 1.56.0
title: Tatar Morphological Analyzer
emoji: 🔤
colorFrom: blue
colorTo: green
sdk: streamlit
pinned: true
app_file: app.py
license: mit
short_description: Interactive demo for 5 state-of-the-art Tagger models
sdk_version: 1.55.0
🔤 Tatar Morphological Analyzer
Interactive exploration of five fine‑tuned models for Tagger morphological tagging
Compare mBERT, RuBERT, DistilBERT, XLM‑R, and Turkish BERT on your own sentences
🌟 Overview
This Space provides a unified interface to five state‑of‑the‑art models for morphological analysis of the Tatar language. Each model predicts full morphological tags (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.
Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.
🚀 Key Features
🧩 Model Selection
- Switch between 5 different transformer models:
- mBERT (best overall accuracy)
- RuBERT (excellent due to Russian proximity)
- DistilBERT (lightweight, fast)
- XLM‑R (powerful multilingual)
- Turkish BERT (good baseline)
🔍 Interactive Analysis
- Real‑time tagging: Get token‑level morphological tags with confidence scores.
- Visual badges: Colour‑coded display of tags for quick scanning.
- Example sentences: Pre‑loaded examples for instant testing.
📊 Model Metrics
- For each model, you can see its token accuracy, F1‑micro, and F1‑macro directly in the sidebar.
🎮 Quick Start Examples
Try These Sentences
| Language | Sentence |
|---|---|
| Tatar | Мин татарча сөйләшәм. |
| Tatar | Кичә мин дусларым белән паркка бардым. |
| Tatar | Татарстан – Россия Федерациясе составындагы республика. |
Just paste any sentence into the text box and click Analyze!
Expected Output (for mBERT on the first sentence)
| Word | Morphological Tag | Confidence |
|---|---|---|
| Мин | Pron+Sg+Nom+Pers(1) | 0.999 |
| татарча | Adv | 0.998 |
| сөйләшәм | V+Pres+1 | 0.997 |
| . | PUNCT | 1.000 |
📈 Model Performance Comparison
| Model | Token Accuracy | F1‑micro | F1‑macro | Speed (sent./sec) |
|---|---|---|---|---|
| mBERT | 98.68% | 98.68% | 50.94% | 150 |
| RuBERT | 98.13% | 98.13% | 47.37% | 150 |
| DistilBERT | 97.98% | 97.98% | 44.02% | 250 |
| XLM‑R | 97.67% | 97.67% | 40.61% | 120 |
| Turkish BERT | 86.84% | 86.84% | 33.34% | 150 |
Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the
results/folder of each model repository.
🏗️ Technical Architecture
Models
All models are fine‑tuned from popular transformer checkpoints:
| Model | Base Checkpoint | Parameters |
|---|---|---|
| mBERT | bert-base-multilingual-cased |
~180M |
| RuBERT | DeepPavlov/rubert-base-cased |
~180M |
| DistilBERT | distilbert-base-multilingual-cased |
~134M |
| XLM‑R | xlm-roberta-base |
~560M |
| Turkish BERT | dbmdz/bert-base-turkish-cased |
~180M |
Training Data
- Dataset: TatarNLPWorld/tatar-morphological-corpus
- Training subset: 60,000 sentences (shuffled seed 42, filtered → 59,992)
- Split: Train 47,993 / Validation 5,999 / Test 6,000
- Tagset: 1,181 unique morphological tags (e.g.,
N+Sg+Nom,V+Past+3,PUNCT)
Hyperparameters (common to all models)
| Parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Optimizer | AdamW (wd=0.01) |
| Warmup steps | 500 |
| Number of epochs | 4 |
| Max sequence length | 128 |
| Mixed precision | FP16 |
Training Hardware
- GPU: NVIDIA Tesla V100 (32GB)
- Training time: 4–8 hours per model
- Inference speed: Varies by model (see table above)
📦 Repository Structure
.
├── app.py # Main Streamlit application
├── requirements.txt # Python dependencies
├── .streamlit/
│ └── config.toml # Streamlit server config (port 7860)
├── results/ # (optional) Additional metrics and plots
└── README.md # This file
🚀 Local Deployment
# Clone the Space
git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
cd tatar-morph-analyzer
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py --server.port 8501
The app will be available at http://localhost:8501.
📜 Citation
If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:
@misc{tatar-morph-analyzer,
author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
title = {Tatar Morphological Analyzer – Interactive Demo},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
}
📄 License
The code in this Space is released under the MIT License. Each model retains its own license (all are Apache 2.0 or MIT).
🙏 Acknowledgments
- Dataset: TatarNLPWorld/tatar-morphological-corpus
- Model checkpoints: Hugging Face Hub
- Framework: Streamlit, Transformers, PyTorch
- Community: All contributors to TatarNLPWorld
Explore and advance Tagger language technology
Brought to you by TatarNLPWorld