ArabovMK commited on
Commit
da457d3
·
verified ·
1 Parent(s): 9103de1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -177
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: TatarMorphAnalyzer
3
  emoji: 🔤
4
  colorFrom: blue
5
  colorTo: green
@@ -7,254 +7,183 @@ sdk: streamlit
7
  pinned: true
8
  app_file: app.py
9
  license: mit
10
- short_description: Tatar Morphological Analyzer
11
  sdk_version: 1.55.0
12
  ---
13
 
14
- ---
15
- language:
16
- - tt
17
- - multilingual
18
- license: mit
19
- library_name: transformers
20
- tags:
21
- - tatar
22
- - morphology
23
- - token-classification
24
- - bert
25
- - multilingual
26
- - turkic-languages
27
- - seqeval
28
- datasets:
29
- - TatarNLPWorld/tatar-morphological-corpus
30
- metrics:
31
- - accuracy
32
- - f1
33
- - precision
34
- - recall
35
- widget:
36
- - text: "Мин татарча сөйләшәм"
37
- example_title: "Simple sentence"
38
- - text: "Кичә мин дусларым белән паркка бардым"
39
- example_title: "Complex sentence"
40
- - text: "Татарстан – Россия Федерациясе составындагы республика"
41
- example_title: "Definition"
42
- model-index:
43
- - name: tatar-morph-mbert
44
- results:
45
- - task:
46
- type: token-classification
47
- name: Morphological Analysis
48
- dataset:
49
- name: TatarNLPWorld/tatar-morphological-corpus
50
- type: TatarNLPWorld/tatar-morphological-corpus
51
- split: test
52
- revision: main
53
- metrics:
54
- - type: accuracy
55
- value: 0.9868
56
- name: Token Accuracy
57
- - type: f1
58
- value: 0.9868
59
- name: F1-micro
60
- - type: f1
61
- value: 0.5094
62
- name: F1-macro
63
- ---
64
-
65
- # 🔤 Tatar Morphological Analyzer (mBERT)
66
 
67
  <div align="center">
68
 
69
- **State‑of‑the‑art morphological tagging for the Tatar language**
70
 
71
- *Fine‑tuned multilingual BERT for tokenlevel prediction of full morphological tags*
72
 
73
- [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Model%20Hub-blue)](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert)
74
  [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
75
- [![Transformers](https://img.shields.io/badge/🤗-Transformers-FF6F00)](https://github.com/huggingface/transformers)
76
- [![Paper](https://img.shields.io/badge/Paper-LREC%202026-red)](https://example.com)
77
 
78
  </div>
79
 
80
  ## 🌟 Overview
81
 
82
- This model is a fine‑tuned version of **Multilingual BERT (mBERT)** for **morphological analysis of the Tatar language**. It performs token‑level prediction of full morphological tags (including part‑of‑speech, number, case, possession, etc.) — a crucial step for many downstream NLP tasks.
83
 
84
- Part of the [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) ecosystem, this model achieves **near‑perfect accuracy** on the test set and is the best performer among our series.
85
 
86
  ## 🚀 Key Features
87
 
88
- ### 🔍 High‑Accuracy Tagging
89
- - Predicts **complete morphological tags** (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`) for every token.
90
- - Handles a rich tagset of **1,181 unique morphological combinations**.
91
-
92
- ### 🌐 Multilingual Transfer
93
- - Leverages the power of **mBERT** (trained on 104 languages) to achieve excellent performance on Tatar with limited fine‑tuning data.
94
-
95
- ### 📦 Easy Integration
96
- - Ready‑to‑use via Hugging Face `transformers` pipeline.
97
- - Compatible with `token-classification` and `ner` pipelines.
98
 
99
- ## 📈 Performance Metrics
 
 
 
100
 
101
- | Metric | Value | 95% Confidence Interval |
102
- |----------------------|------------|-----------------------------|
103
- | **Token Accuracy** | **98.68%** | [98.58%, 98.78%] |
104
- | **F1 (micro)** | **98.68%** | [98.58%, 98.78%] |
105
- | **F1 (macro)** | **50.94%** | [48.73%, 53.15%] |
106
- | Precision (micro) | 98.68% | — |
107
- | Recall (micro) | 98.68% | — |
108
-
109
- ### Accuracy by Part‑of‑Speech (Top 5 Frequent POS)
110
-
111
- | POS | Accuracy |
112
- |-------|----------|
113
- | PUNCT | 100.00% |
114
- | NOUN | 98.75% |
115
- | VERB | 98.12% |
116
- | ADP | 99.65% |
117
- | ADJ | 97.50% |
118
-
119
- > *Full POS breakdown is available in the [`results/`](results/) folder of this repository.*
120
 
121
  ## 🎮 Quick Start Examples
122
 
123
- ### Using the Pipeline (Easiest)
124
 
125
- ```python
126
- from transformers import pipeline
 
 
 
127
 
128
- pipe = pipeline(
129
- "token-classification",
130
- model="TatarNLPWorld/tatar-morph-mbert",
131
- aggregation_strategy="simple"
132
- )
133
 
134
- sentence = "Мин татарча сөйләшәм."
135
- results = pipe(sentence)
136
 
137
- for r in results:
138
- print(f"{r['word']}: {r['entity']}")
139
- ```
 
 
 
140
 
141
- **Output:**
142
- ```
143
- Мин: Pron+Sg+Nom+Pers(1)
144
- татарча: Adv
145
- сөйләшәм: V+Pres+1
146
- .: PUNCT
147
- ```
148
 
149
- ### Manual Inference (with tokenizer)
 
 
 
 
 
 
150
 
151
- ```python
152
- from transformers import AutoTokenizer, AutoModelForTokenClassification
153
- import torch
154
 
155
- tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/tatar-morph-mbert")
156
- model = AutoModelForTokenClassification.from_pretrained("TatarNLPWorld/tatar-morph-mbert")
157
 
158
- sentence = "Кичә мин дусларым белән паркка бардым."
159
- inputs = tokenizer(sentence, return_tensors="pt", is_split_into_words=False)
160
- with torch.no_grad():
161
- outputs = model(**inputs).logits
162
 
163
- predictions = torch.argmax(outputs, dim=2)
164
- tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
165
- word_ids = inputs.word_ids()
166
 
167
- prev_word = None
168
- for token, pred, word_id in zip(tokens, predictions[0], word_ids):
169
- if word_id is not None and word_id != prev_word:
170
- tag = model.config.id2label[pred.item()]
171
- print(f"{token}: {tag}")
172
- prev_word = word_id
173
- ```
174
 
175
- ## 🏗️ Technical Architecture
176
 
177
- ### Model Details
 
 
 
178
 
179
- - **Base model**: [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) (12 layers, 768 hidden size, 12 heads, ~180M params)
180
- - **Fine‑tuning task**: Token classification with a linear head
181
- - **Tagset size**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
182
 
183
- ### Training Data
 
 
 
 
 
 
 
184
 
185
- - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
186
- - **Training subset**: 60,000 sentences (shuffled with seed 42, filtered empty sentences → 59,992)
187
- - **Split**: Train 47,993 / Validation 5,999 / Test 6,000 sentences
188
- - **Tagset**: extracted from the dataset (all unique tag sequences)
189
 
190
- ### Training Procedure
 
 
191
 
192
- #### Hyperparameters
193
 
194
- | Parameter | Value |
195
- |-------------------------|----------------|
196
- | Batch size (effective) | 32 |
197
- | Learning rate | 2e-5 |
198
- | Optimizer | AdamW (wd=0.01)|
199
- | Warmup steps | 500 |
200
- | Number of epochs | 4 |
201
- | Max sequence length | 128 |
202
- | Mixed precision | FP16 |
203
 
204
- #### Training Time & Resources
205
 
206
- - **Hardware**: 1× NVIDIA Tesla V100 (32GB)
207
- - **Training time**: ~6.5 hours
208
- - **Model size**: ~680 MB (PyTorch checkpoint)
209
- - **Inference speed**: ~150 sentences/sec on V100
210
 
211
- ## 📊 Dataset Details
 
212
 
213
- The model was trained on the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus), which contains manually annotated morphological analyses.
 
 
214
 
215
- | Property | Value |
216
- |------------------------|------------|
217
- | Total sentences | 59,992 |
218
- | Unique tags | 1,181 |
219
- | Avg. sentence length | 8.0 tokens |
220
- | Median sentence length | 6 tokens |
221
- | Language | Tatar (tt) |
222
 
223
  ## 📜 Citation
224
 
225
- If you use this model in your research, please cite:
226
 
227
  ```bibtex
228
- @misc{tatar-morph-mbert,
229
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
230
- title = {Multilingual BERT for Tatar Morphological Analysis},
231
  year = {2026},
232
  publisher = {Hugging Face},
233
- howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-mbert}}
234
  }
235
  ```
236
 
237
  ## 📄 License
238
 
239
- This model is released under the **MIT License**. You are free to use, modify, and distribute it for any purpose, with proper attribution.
240
 
241
  ## 🙏 Acknowledgments
242
 
243
  - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
244
- - **Base model**: [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
245
- - **Framework**: Hugging Face Transformers
246
- - **Community**: Tatar language speakers and NLP researchers
247
 
248
  ---
249
 
250
  <div align="center">
251
 
252
- **Empowering Tatar Language Technology**
253
 
254
  *Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
255
 
256
- [Report Issue](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert/discussions) •
257
- [Request Feature](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert/discussions) •
258
  [Contact](mailto:arabov.mk@gmail.com)
259
 
260
  </div>
 
1
  ---
2
+ title: Tatar Morphological Analyzer
3
  emoji: 🔤
4
  colorFrom: blue
5
  colorTo: green
 
7
  pinned: true
8
  app_file: app.py
9
  license: mit
10
+ short_description: Interactive demo for 5 state-of-the-art Tagger models
11
  sdk_version: 1.55.0
12
  ---
13
 
14
+ # 🔤 Tatar Morphological Analyzer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  <div align="center">
17
 
18
+ **Interactive exploration of five fine‑tuned models for Tagger morphological tagging**
19
 
20
+ *Compare mBERT, RuBERT, DistilBERT, XLMR, and Turkish BERT on your own sentences*
21
 
22
+ [![🤗 Hugging Face](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer)
23
  [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
24
+ [![Models](https://img.shields.io/badge/5-Models-orange)](#-available-models)
25
+ [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
26
 
27
  </div>
28
 
29
  ## 🌟 Overview
30
 
31
+ This Space provides a unified interface to **five state‑of‑the‑art models** for morphological analysis of the Tatar language. Each model predicts **full morphological tags** (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.
32
 
33
+ Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.
34
 
35
  ## 🚀 Key Features
36
 
37
+ ### 🧩 Model Selection
38
+ - Switch between **5 different transformer models**:
39
+ - **mBERT** (best overall accuracy)
40
+ - **RuBERT** (excellent due to Russian proximity)
41
+ - **DistilBERT** (lightweight, fast)
42
+ - **XLM‑R** (powerful multilingual)
43
+ - **Turkish BERT** (good baseline)
 
 
 
44
 
45
+ ### 🔍 Interactive Analysis
46
+ - **Real‑time tagging**: Get token‑level morphological tags with confidence scores.
47
+ - **Visual badges**: Colour‑coded display of tags for quick scanning.
48
+ - **Example sentences**: Pre‑loaded examples for instant testing.
49
 
50
+ ### 📊 Model Metrics
51
+ - For each model, you can see its **token accuracy**, **F1‑micro**, and **F1‑macro** directly in the sidebar.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
  ## 🎮 Quick Start Examples
54
 
55
+ ### Try These Sentences
56
 
57
+ | Language | Sentence |
58
+ |----------|----------|
59
+ | Tatar | Мин татарча сөйләшәм. |
60
+ | Tatar | Кичә мин дусларым белән паркка бардым. |
61
+ | Tatar | Татарстан – Россия Федерациясе составындагы республика. |
62
 
63
+ Just paste any sentence into the text box and click **Analyze**!
 
 
 
 
64
 
65
+ ### Expected Output (for mBERT on the first sentence)
 
66
 
67
+ | Word | Morphological Tag | Confidence |
68
+ |-----------|-------------------------|------------|
69
+ | Мин | Pron+Sg+Nom+Pers(1) | 0.999 |
70
+ | татарча | Adv | 0.998 |
71
+ | сөйләшәм | V+Pres+1 | 0.997 |
72
+ | . | PUNCT | 1.000 |
73
 
74
+ ## 📈 Model Performance Comparison
 
 
 
 
 
 
75
 
76
+ | Model | Token Accuracy | F1‑micro | F1‑macro | Speed (sent./sec) |
77
+ |----------------|----------------|----------|----------|-------------------|
78
+ | **mBERT** | 98.68% | 98.68% | 50.94% | 150 |
79
+ | **RuBERT** | 98.13% | 98.13% | 47.37% | 150 |
80
+ | **DistilBERT** | 97.98% | 97.98% | 44.02% | 250 |
81
+ | **XLM‑R** | 97.67% | 97.67% | 40.61% | 120 |
82
+ | **Turkish BERT**| 86.84% | 86.84% | 33.34% | 150 |
83
 
84
+ > *Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the `results/` folder of each model repository.*
 
 
85
 
86
+ ## 🏗️ Technical Architecture
 
87
 
88
+ ### Models
 
 
 
89
 
90
+ All models are fine‑tuned from popular transformer checkpoints:
 
 
91
 
92
+ | Model | Base Checkpoint | Parameters |
93
+ |------------|-------------------------------------------------------|------------|
94
+ | mBERT | `bert-base-multilingual-cased` | ~180M |
95
+ | RuBERT | `DeepPavlov/rubert-base-cased` | ~180M |
96
+ | DistilBERT | `distilbert-base-multilingual-cased` | ~134M |
97
+ | XLM‑R | `xlm-roberta-base` | ~560M |
98
+ | Turkish BERT| `dbmdz/bert-base-turkish-cased` | ~180M |
99
 
100
+ ### Training Data
101
 
102
+ - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
103
+ - **Training subset**: 60,000 sentences (shuffled seed 42, filtered → 59,992)
104
+ - **Split**: Train 47,993 / Validation 5,999 / Test 6,000
105
+ - **Tagset**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
106
 
107
+ ### Hyperparameters (common to all models)
 
 
108
 
109
+ | Parameter | Value |
110
+ |---------------------|----------------|
111
+ | Learning rate | 2e-5 |
112
+ | Optimizer | AdamW (wd=0.01)|
113
+ | Warmup steps | 500 |
114
+ | Number of epochs | 4 |
115
+ | Max sequence length | 128 |
116
+ | Mixed precision | FP16 |
117
 
118
+ ### Training Hardware
 
 
 
119
 
120
+ - **GPU**: NVIDIA Tesla V100 (32GB)
121
+ - **Training time**: 4–8 hours per model
122
+ - **Inference speed**: Varies by model (see table above)
123
 
124
+ ## 📦 Repository Structure
125
 
126
+ ```
127
+ .
128
+ ├── app.py # Main Streamlit application
129
+ ├── requirements.txt # Python dependencies
130
+ ├── .streamlit/
131
+ │ └── config.toml # Streamlit server config (port 7860)
132
+ ├── results/ # (optional) Additional metrics and plots
133
+ └── README.md # This file
134
+ ```
135
 
136
+ ## 🚀 Local Deployment
137
 
138
+ ```bash
139
+ # Clone the Space
140
+ git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
141
+ cd tatar-morph-analyzer
142
 
143
+ # Install dependencies
144
+ pip install -r requirements.txt
145
 
146
+ # Run the app
147
+ streamlit run app.py --server.port 8501
148
+ ```
149
 
150
+ The app will be available at `http://localhost:8501`.
 
 
 
 
 
 
151
 
152
  ## 📜 Citation
153
 
154
+ If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:
155
 
156
  ```bibtex
157
+ @misc{tatar-morph-analyzer,
158
  author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
159
+ title = {Tatar Morphological Analyzer Interactive Demo},
160
  year = {2026},
161
  publisher = {Hugging Face},
162
+ howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
163
  }
164
  ```
165
 
166
  ## 📄 License
167
 
168
+ The code in this Space is released under the **MIT License**. Each model retains its own license (all are Apache 2.0 or MIT).
169
 
170
  ## 🙏 Acknowledgments
171
 
172
  - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
173
+ - **Model checkpoints**: Hugging Face Hub
174
+ - **Framework**: Streamlit, Transformers, PyTorch
175
+ - **Community**: All contributors to TatarNLPWorld
176
 
177
  ---
178
 
179
  <div align="center">
180
 
181
+ **Explore and advance Tagger language technology**
182
 
183
  *Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
184
 
185
+ [Report Issue](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
186
+ [Request Feature](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
187
  [Contact](mailto:arabov.mk@gmail.com)
188
 
189
  </div>