Spaces:

TatarNLPWorld
/

TatarMorphAnalyzer

Runtime error

App Files Files Community

TatarMorphAnalyzer / README.md

ArabovMK

Update README.md

da457d3 verified 28 days ago

preview code

raw

history blame contribute delete

7.27 kB

A newer version of the Streamlit SDK is available: 1.56.0

Upgrade

metadata

title: Tatar Morphological Analyzer
emoji: 🔤
colorFrom: blue
colorTo: green
sdk: streamlit
pinned: true
app_file: app.py
license: mit
short_description: Interactive demo for 5 state-of-the-art Tagger models
sdk_version: 1.55.0

🔤 Tatar Morphological Analyzer

Interactive exploration of five fine‑tuned models for Tagger morphological tagging

Compare mBERT, RuBERT, DistilBERT, XLM‑R, and Turkish BERT on your own sentences

🌟 Overview

This Space provides a unified interface to five state‑of‑the‑art models for morphological analysis of the Tatar language. Each model predicts full morphological tags (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.

Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.

🚀 Key Features

🧩 Model Selection

Switch between 5 different transformer models:
- mBERT (best overall accuracy)
- RuBERT (excellent due to Russian proximity)
- DistilBERT (lightweight, fast)
- XLM‑R (powerful multilingual)
- Turkish BERT (good baseline)

🔍 Interactive Analysis

Real‑time tagging: Get token‑level morphological tags with confidence scores.
Visual badges: Colour‑coded display of tags for quick scanning.
Example sentences: Pre‑loaded examples for instant testing.

📊 Model Metrics

For each model, you can see its token accuracy, F1‑micro, and F1‑macro directly in the sidebar.

🎮 Quick Start Examples

Try These Sentences

Language	Sentence
Tatar	Мин татарча сөйләшәм.
Tatar	Кичә мин дусларым белән паркка бардым.
Tatar	Татарстан – Россия Федерациясе составындагы республика.

Just paste any sentence into the text box and click Analyze!

Expected Output (for mBERT on the first sentence)

Word	Morphological Tag	Confidence
Мин	Pron+Sg+Nom+Pers(1)	0.999
татарча	Adv	0.998
сөйләшәм	V+Pres+1	0.997
.	PUNCT	1.000

📈 Model Performance Comparison

Model	Token Accuracy	F1‑micro	F1‑macro	Speed (sent./sec)
mBERT	98.68%	98.68%	50.94%	150
RuBERT	98.13%	98.13%	47.37%	150
DistilBERT	97.98%	97.98%	44.02%	250
XLM‑R	97.67%	97.67%	40.61%	120
Turkish BERT	86.84%	86.84%	33.34%	150

Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the results/ folder of each model repository.

🏗️ Technical Architecture

Models

All models are fine‑tuned from popular transformer checkpoints:

Model	Base Checkpoint	Parameters
mBERT	`bert-base-multilingual-cased`	~180M
RuBERT	`DeepPavlov/rubert-base-cased`	~180M
DistilBERT	`distilbert-base-multilingual-cased`	~134M
XLM‑R	`xlm-roberta-base`	~560M
Turkish BERT	`dbmdz/bert-base-turkish-cased`	~180M

Training Data

Dataset: TatarNLPWorld/tatar-morphological-corpus
Training subset: 60,000 sentences (shuffled seed 42, filtered → 59,992)
Split: Train 47,993 / Validation 5,999 / Test 6,000
Tagset: 1,181 unique morphological tags (e.g., N+Sg+Nom, V+Past+3, PUNCT)

Hyperparameters (common to all models)

Parameter	Value
Learning rate	2e-5
Optimizer	AdamW (wd=0.01)
Warmup steps	500
Number of epochs	4
Max sequence length	128
Mixed precision	FP16

Training Hardware

GPU: NVIDIA Tesla V100 (32GB)
Training time: 4–8 hours per model
Inference speed: Varies by model (see table above)

📦 Repository Structure

.
├── app.py                  # Main Streamlit application
├── requirements.txt        # Python dependencies
├── .streamlit/
│   └── config.toml        # Streamlit server config (port 7860)
├── results/                # (optional) Additional metrics and plots
└── README.md              # This file

🚀 Local Deployment

# Clone the Space
git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
cd tatar-morph-analyzer

# Install dependencies
pip install -r requirements.txt

# Run the app
streamlit run app.py --server.port 8501

The app will be available at http://localhost:8501.

📜 Citation

If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:

@misc{tatar-morph-analyzer,
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
  title = {Tatar Morphological Analyzer – Interactive Demo},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
}

📄 License

The code in this Space is released under the MIT License. Each model retains its own license (all are Apache 2.0 or MIT).

🙏 Acknowledgments

Dataset: TatarNLPWorld/tatar-morphological-corpus
Model checkpoints: Hugging Face Hub
Framework: Streamlit, Transformers, PyTorch
Community: All contributors to TatarNLPWorld

Explore and advance Tagger language technology

Brought to you by TatarNLPWorld

Report Issue • Request Feature • Contact