Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 🔤
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
@@ -7,254 +7,183 @@ sdk: streamlit
|
|
| 7 |
pinned: true
|
| 8 |
app_file: app.py
|
| 9 |
license: mit
|
| 10 |
-
short_description:
|
| 11 |
sdk_version: 1.55.0
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
language:
|
| 16 |
-
- tt
|
| 17 |
-
- multilingual
|
| 18 |
-
license: mit
|
| 19 |
-
library_name: transformers
|
| 20 |
-
tags:
|
| 21 |
-
- tatar
|
| 22 |
-
- morphology
|
| 23 |
-
- token-classification
|
| 24 |
-
- bert
|
| 25 |
-
- multilingual
|
| 26 |
-
- turkic-languages
|
| 27 |
-
- seqeval
|
| 28 |
-
datasets:
|
| 29 |
-
- TatarNLPWorld/tatar-morphological-corpus
|
| 30 |
-
metrics:
|
| 31 |
-
- accuracy
|
| 32 |
-
- f1
|
| 33 |
-
- precision
|
| 34 |
-
- recall
|
| 35 |
-
widget:
|
| 36 |
-
- text: "Мин татарча сөйләшәм"
|
| 37 |
-
example_title: "Simple sentence"
|
| 38 |
-
- text: "Кичә мин дусларым белән паркка бардым"
|
| 39 |
-
example_title: "Complex sentence"
|
| 40 |
-
- text: "Татарстан – Россия Федерациясе составындагы республика"
|
| 41 |
-
example_title: "Definition"
|
| 42 |
-
model-index:
|
| 43 |
-
- name: tatar-morph-mbert
|
| 44 |
-
results:
|
| 45 |
-
- task:
|
| 46 |
-
type: token-classification
|
| 47 |
-
name: Morphological Analysis
|
| 48 |
-
dataset:
|
| 49 |
-
name: TatarNLPWorld/tatar-morphological-corpus
|
| 50 |
-
type: TatarNLPWorld/tatar-morphological-corpus
|
| 51 |
-
split: test
|
| 52 |
-
revision: main
|
| 53 |
-
metrics:
|
| 54 |
-
- type: accuracy
|
| 55 |
-
value: 0.9868
|
| 56 |
-
name: Token Accuracy
|
| 57 |
-
- type: f1
|
| 58 |
-
value: 0.9868
|
| 59 |
-
name: F1-micro
|
| 60 |
-
- type: f1
|
| 61 |
-
value: 0.5094
|
| 62 |
-
name: F1-macro
|
| 63 |
-
---
|
| 64 |
-
|
| 65 |
-
# 🔤 Tatar Morphological Analyzer (mBERT)
|
| 66 |
|
| 67 |
<div align="center">
|
| 68 |
|
| 69 |
-
**
|
| 70 |
|
| 71 |
-
*
|
| 72 |
|
| 73 |
-
[](LICENSE)
|
| 75 |
-
[![
|
| 76 |
-
[![
|
| 77 |
|
| 78 |
</div>
|
| 79 |
|
| 80 |
## 🌟 Overview
|
| 81 |
|
| 82 |
-
This
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
## 🚀 Key Features
|
| 87 |
|
| 88 |
-
###
|
| 89 |
-
-
|
| 90 |
-
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
-
|
| 94 |
-
|
| 95 |
-
### 📦 Easy Integration
|
| 96 |
-
- Ready‑to‑use via Hugging Face `transformers` pipeline.
|
| 97 |
-
- Compatible with `token-classification` and `ner` pipelines.
|
| 98 |
|
| 99 |
-
##
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
| **Token Accuracy** | **98.68%** | [98.58%, 98.78%] |
|
| 104 |
-
| **F1 (micro)** | **98.68%** | [98.58%, 98.78%] |
|
| 105 |
-
| **F1 (macro)** | **50.94%** | [48.73%, 53.15%] |
|
| 106 |
-
| Precision (micro) | 98.68% | — |
|
| 107 |
-
| Recall (micro) | 98.68% | — |
|
| 108 |
-
|
| 109 |
-
### Accuracy by Part‑of‑Speech (Top 5 Frequent POS)
|
| 110 |
-
|
| 111 |
-
| POS | Accuracy |
|
| 112 |
-
|-------|----------|
|
| 113 |
-
| PUNCT | 100.00% |
|
| 114 |
-
| NOUN | 98.75% |
|
| 115 |
-
| VERB | 98.12% |
|
| 116 |
-
| ADP | 99.65% |
|
| 117 |
-
| ADJ | 97.50% |
|
| 118 |
-
|
| 119 |
-
> *Full POS breakdown is available in the [`results/`](results/) folder of this repository.*
|
| 120 |
|
| 121 |
## 🎮 Quick Start Examples
|
| 122 |
|
| 123 |
-
###
|
| 124 |
|
| 125 |
-
|
| 126 |
-
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
-
|
| 129 |
-
"token-classification",
|
| 130 |
-
model="TatarNLPWorld/tatar-morph-mbert",
|
| 131 |
-
aggregation_strategy="simple"
|
| 132 |
-
)
|
| 133 |
|
| 134 |
-
|
| 135 |
-
results = pipe(sentence)
|
| 136 |
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
| 142 |
-
```
|
| 143 |
-
Мин: Pron+Sg+Nom+Pers(1)
|
| 144 |
-
татарча: Adv
|
| 145 |
-
сөйләшәм: V+Pres+1
|
| 146 |
-
.: PUNCT
|
| 147 |
-
```
|
| 148 |
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
-
``
|
| 152 |
-
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 153 |
-
import torch
|
| 154 |
|
| 155 |
-
|
| 156 |
-
model = AutoModelForTokenClassification.from_pretrained("TatarNLPWorld/tatar-morph-mbert")
|
| 157 |
|
| 158 |
-
|
| 159 |
-
inputs = tokenizer(sentence, return_tensors="pt", is_split_into_words=False)
|
| 160 |
-
with torch.no_grad():
|
| 161 |
-
outputs = model(**inputs).logits
|
| 162 |
|
| 163 |
-
|
| 164 |
-
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
| 165 |
-
word_ids = inputs.word_ids()
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
|
| 173 |
-
``
|
| 174 |
|
| 175 |
-
##
|
| 176 |
|
| 177 |
-
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
| 180 |
-
- **Fine‑tuning task**: Token classification with a linear head
|
| 181 |
-
- **Tagset size**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
|
| 182 |
|
| 183 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
|
| 185 |
-
|
| 186 |
-
- **Training subset**: 60,000 sentences (shuffled with seed 42, filtered empty sentences → 59,992)
|
| 187 |
-
- **Split**: Train 47,993 / Validation 5,999 / Test 6,000 sentences
|
| 188 |
-
- **Tagset**: extracted from the dataset (all unique tag sequences)
|
| 189 |
|
| 190 |
-
|
|
|
|
|
|
|
| 191 |
|
| 192 |
-
##
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
| 201 |
-
|
| 202 |
-
|
| 203 |
|
| 204 |
-
##
|
| 205 |
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
|
| 211 |
-
#
|
|
|
|
| 212 |
|
| 213 |
-
|
|
|
|
|
|
|
| 214 |
|
| 215 |
-
|
| 216 |
-
|------------------------|------------|
|
| 217 |
-
| Total sentences | 59,992 |
|
| 218 |
-
| Unique tags | 1,181 |
|
| 219 |
-
| Avg. sentence length | 8.0 tokens |
|
| 220 |
-
| Median sentence length | 6 tokens |
|
| 221 |
-
| Language | Tatar (tt) |
|
| 222 |
|
| 223 |
## 📜 Citation
|
| 224 |
|
| 225 |
-
If you use this
|
| 226 |
|
| 227 |
```bibtex
|
| 228 |
-
@misc{tatar-morph-
|
| 229 |
author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
|
| 230 |
-
title = {
|
| 231 |
year = {2026},
|
| 232 |
publisher = {Hugging Face},
|
| 233 |
-
howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-
|
| 234 |
}
|
| 235 |
```
|
| 236 |
|
| 237 |
## 📄 License
|
| 238 |
|
| 239 |
-
|
| 240 |
|
| 241 |
## 🙏 Acknowledgments
|
| 242 |
|
| 243 |
- **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
|
| 244 |
-
- **
|
| 245 |
-
- **Framework**:
|
| 246 |
-
- **Community**:
|
| 247 |
|
| 248 |
---
|
| 249 |
|
| 250 |
<div align="center">
|
| 251 |
|
| 252 |
-
**
|
| 253 |
|
| 254 |
*Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
|
| 255 |
|
| 256 |
-
[Report Issue](https://huggingface.co/TatarNLPWorld/tatar-morph-
|
| 257 |
-
[Request Feature](https://huggingface.co/TatarNLPWorld/tatar-morph-
|
| 258 |
[Contact](mailto:arabov.mk@gmail.com)
|
| 259 |
|
| 260 |
</div>
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Tatar Morphological Analyzer
|
| 3 |
emoji: 🔤
|
| 4 |
colorFrom: blue
|
| 5 |
colorTo: green
|
|
|
|
| 7 |
pinned: true
|
| 8 |
app_file: app.py
|
| 9 |
license: mit
|
| 10 |
+
short_description: Interactive demo for 5 state-of-the-art Tagger models
|
| 11 |
sdk_version: 1.55.0
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# 🔤 Tatar Morphological Analyzer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
<div align="center">
|
| 17 |
|
| 18 |
+
**Interactive exploration of five fine‑tuned models for Tagger morphological tagging**
|
| 19 |
|
| 20 |
+
*Compare mBERT, RuBERT, DistilBERT, XLM‑R, and Turkish BERT on your own sentences*
|
| 21 |
|
| 22 |
+
[](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer)
|
| 23 |
[](LICENSE)
|
| 24 |
+
[](#-available-models)
|
| 25 |
+
[](https://streamlit.io)
|
| 26 |
|
| 27 |
</div>
|
| 28 |
|
| 29 |
## 🌟 Overview
|
| 30 |
|
| 31 |
+
This Space provides a unified interface to **five state‑of‑the‑art models** for morphological analysis of the Tatar language. Each model predicts **full morphological tags** (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.
|
| 32 |
|
| 33 |
+
Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.
|
| 34 |
|
| 35 |
## 🚀 Key Features
|
| 36 |
|
| 37 |
+
### 🧩 Model Selection
|
| 38 |
+
- Switch between **5 different transformer models**:
|
| 39 |
+
- **mBERT** (best overall accuracy)
|
| 40 |
+
- **RuBERT** (excellent due to Russian proximity)
|
| 41 |
+
- **DistilBERT** (lightweight, fast)
|
| 42 |
+
- **XLM‑R** (powerful multilingual)
|
| 43 |
+
- **Turkish BERT** (good baseline)
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
### 🔍 Interactive Analysis
|
| 46 |
+
- **Real‑time tagging**: Get token‑level morphological tags with confidence scores.
|
| 47 |
+
- **Visual badges**: Colour‑coded display of tags for quick scanning.
|
| 48 |
+
- **Example sentences**: Pre‑loaded examples for instant testing.
|
| 49 |
|
| 50 |
+
### 📊 Model Metrics
|
| 51 |
+
- For each model, you can see its **token accuracy**, **F1‑micro**, and **F1‑macro** directly in the sidebar.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
## 🎮 Quick Start Examples
|
| 54 |
|
| 55 |
+
### Try These Sentences
|
| 56 |
|
| 57 |
+
| Language | Sentence |
|
| 58 |
+
|----------|----------|
|
| 59 |
+
| Tatar | Мин татарча сөйләшәм. |
|
| 60 |
+
| Tatar | Кичә мин дусларым белән паркка бардым. |
|
| 61 |
+
| Tatar | Татарстан – Россия Федерациясе составындагы республика. |
|
| 62 |
|
| 63 |
+
Just paste any sentence into the text box and click **Analyze**!
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
|
| 65 |
+
### Expected Output (for mBERT on the first sentence)
|
|
|
|
| 66 |
|
| 67 |
+
| Word | Morphological Tag | Confidence |
|
| 68 |
+
|-----------|-------------------------|------------|
|
| 69 |
+
| Мин | Pron+Sg+Nom+Pers(1) | 0.999 |
|
| 70 |
+
| татарча | Adv | 0.998 |
|
| 71 |
+
| сөйләшәм | V+Pres+1 | 0.997 |
|
| 72 |
+
| . | PUNCT | 1.000 |
|
| 73 |
|
| 74 |
+
## 📈 Model Performance Comparison
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
| Model | Token Accuracy | F1‑micro | F1‑macro | Speed (sent./sec) |
|
| 77 |
+
|----------------|----------------|----------|----------|-------------------|
|
| 78 |
+
| **mBERT** | 98.68% | 98.68% | 50.94% | 150 |
|
| 79 |
+
| **RuBERT** | 98.13% | 98.13% | 47.37% | 150 |
|
| 80 |
+
| **DistilBERT** | 97.98% | 97.98% | 44.02% | 250 |
|
| 81 |
+
| **XLM‑R** | 97.67% | 97.67% | 40.61% | 120 |
|
| 82 |
+
| **Turkish BERT**| 86.84% | 86.84% | 33.34% | 150 |
|
| 83 |
|
| 84 |
+
> *Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the `results/` folder of each model repository.*
|
|
|
|
|
|
|
| 85 |
|
| 86 |
+
## 🏗️ Technical Architecture
|
|
|
|
| 87 |
|
| 88 |
+
### Models
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
+
All models are fine‑tuned from popular transformer checkpoints:
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
| Model | Base Checkpoint | Parameters |
|
| 93 |
+
|------------|-------------------------------------------------------|------------|
|
| 94 |
+
| mBERT | `bert-base-multilingual-cased` | ~180M |
|
| 95 |
+
| RuBERT | `DeepPavlov/rubert-base-cased` | ~180M |
|
| 96 |
+
| DistilBERT | `distilbert-base-multilingual-cased` | ~134M |
|
| 97 |
+
| XLM‑R | `xlm-roberta-base` | ~560M |
|
| 98 |
+
| Turkish BERT| `dbmdz/bert-base-turkish-cased` | ~180M |
|
| 99 |
|
| 100 |
+
### Training Data
|
| 101 |
|
| 102 |
+
- **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
|
| 103 |
+
- **Training subset**: 60,000 sentences (shuffled seed 42, filtered → 59,992)
|
| 104 |
+
- **Split**: Train 47,993 / Validation 5,999 / Test 6,000
|
| 105 |
+
- **Tagset**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
|
| 106 |
|
| 107 |
+
### Hyperparameters (common to all models)
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
| Parameter | Value |
|
| 110 |
+
|---------------------|----------------|
|
| 111 |
+
| Learning rate | 2e-5 |
|
| 112 |
+
| Optimizer | AdamW (wd=0.01)|
|
| 113 |
+
| Warmup steps | 500 |
|
| 114 |
+
| Number of epochs | 4 |
|
| 115 |
+
| Max sequence length | 128 |
|
| 116 |
+
| Mixed precision | FP16 |
|
| 117 |
|
| 118 |
+
### Training Hardware
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
+
- **GPU**: NVIDIA Tesla V100 (32GB)
|
| 121 |
+
- **Training time**: 4–8 hours per model
|
| 122 |
+
- **Inference speed**: Varies by model (see table above)
|
| 123 |
|
| 124 |
+
## 📦 Repository Structure
|
| 125 |
|
| 126 |
+
```
|
| 127 |
+
.
|
| 128 |
+
├── app.py # Main Streamlit application
|
| 129 |
+
├── requirements.txt # Python dependencies
|
| 130 |
+
├── .streamlit/
|
| 131 |
+
│ └── config.toml # Streamlit server config (port 7860)
|
| 132 |
+
├── results/ # (optional) Additional metrics and plots
|
| 133 |
+
└── README.md # This file
|
| 134 |
+
```
|
| 135 |
|
| 136 |
+
## 🚀 Local Deployment
|
| 137 |
|
| 138 |
+
```bash
|
| 139 |
+
# Clone the Space
|
| 140 |
+
git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
|
| 141 |
+
cd tatar-morph-analyzer
|
| 142 |
|
| 143 |
+
# Install dependencies
|
| 144 |
+
pip install -r requirements.txt
|
| 145 |
|
| 146 |
+
# Run the app
|
| 147 |
+
streamlit run app.py --server.port 8501
|
| 148 |
+
```
|
| 149 |
|
| 150 |
+
The app will be available at `http://localhost:8501`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
|
| 152 |
## 📜 Citation
|
| 153 |
|
| 154 |
+
If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:
|
| 155 |
|
| 156 |
```bibtex
|
| 157 |
+
@misc{tatar-morph-analyzer,
|
| 158 |
author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
|
| 159 |
+
title = {Tatar Morphological Analyzer – Interactive Demo},
|
| 160 |
year = {2026},
|
| 161 |
publisher = {Hugging Face},
|
| 162 |
+
howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
|
| 163 |
}
|
| 164 |
```
|
| 165 |
|
| 166 |
## 📄 License
|
| 167 |
|
| 168 |
+
The code in this Space is released under the **MIT License**. Each model retains its own license (all are Apache 2.0 or MIT).
|
| 169 |
|
| 170 |
## 🙏 Acknowledgments
|
| 171 |
|
| 172 |
- **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
|
| 173 |
+
- **Model checkpoints**: Hugging Face Hub
|
| 174 |
+
- **Framework**: Streamlit, Transformers, PyTorch
|
| 175 |
+
- **Community**: All contributors to TatarNLPWorld
|
| 176 |
|
| 177 |
---
|
| 178 |
|
| 179 |
<div align="center">
|
| 180 |
|
| 181 |
+
**Explore and advance Tagger language technology**
|
| 182 |
|
| 183 |
*Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
|
| 184 |
|
| 185 |
+
[Report Issue](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
|
| 186 |
+
[Request Feature](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
|
| 187 |
[Contact](mailto:arabov.mk@gmail.com)
|
| 188 |
|
| 189 |
</div>
|