Spaces:

TatarNLPWorld
/

TatarMorphAnalyzer

Runtime error

App Files Files Community

ArabovMK commited on Mar 19

Commit

da457d3

verified ·

1 Parent(s): 9103de1

Update README.md

Browse files

Files changed (1) hide show

README.md +106 -177

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: TatarMorphAnalyzer
 emoji: 🔤
 colorFrom: blue
 colorTo: green
@@ -7,254 +7,183 @@ sdk: streamlit
 pinned: true
 app_file: app.py
 license: mit
-short_description: Tatar Morphological Analyzer
 sdk_version: 1.55.0
 ---
----
-language:
-- tt
-- multilingual
-license: mit
-library_name: transformers
-tags:
-- tatar
-- morphology
-- token-classification
-- bert
-- multilingual
-- turkic-languages
-- seqeval
-datasets:
-- TatarNLPWorld/tatar-morphological-corpus
-metrics:
-- accuracy
-- f1
-- precision
-- recall
-widget:
-- text: "Мин татарча сөйләшәм"
-  example_title: "Simple sentence"
-- text: "Кичә мин дусларым белән паркка бардым"
-  example_title: "Complex sentence"
-- text: "Татарстан – Россия Федерациясе составындагы республика"
-  example_title: "Definition"
-model-index:
-- name: tatar-morph-mbert
-  results:
-  - task:
-      type: token-classification
-      name: Morphological Analysis
-    dataset:
-      name: TatarNLPWorld/tatar-morphological-corpus
-      type: TatarNLPWorld/tatar-morphological-corpus
-      split: test
-      revision: main
-    metrics:
-      - type: accuracy
-        value: 0.9868
-        name: Token Accuracy
-      - type: f1
-        value: 0.9868
-        name: F1-micro
-      - type: f1
-        value: 0.5094
-        name: F1-macro
----
-# 🔤 Tatar Morphological Analyzer (mBERT)
 <div align="center">
-**State‑of‑the‑art morphological tagging for the Tatar language**
-*Fine‑tuned multilingual BERT for token‑level prediction of full morphological tags*
-[![🤗 Hugging Face](https://img.shields.io/badge/🤗-Model%20Hub-blue)](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert)
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
-[![Transformers](https://img.shields.io/badge/🤗-Transformers-FF6F00)](https://github.com/huggingface/transformers)
-[![Paper](https://img.shields.io/badge/Paper-LREC%202026-red)](https://example.com)
 </div>
 ## 🌟 Overview
-This model is a fine‑tuned version of **Multilingual BERT (mBERT)** for **morphological analysis of the Tatar language**. It performs token‑level prediction of full morphological tags (including part‑of‑speech, number, case, possession, etc.) — a crucial step for many downstream NLP tasks.
-Part of the [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) ecosystem, this model achieves **near‑perfect accuracy** on the test set and is the best performer among our series.
 ## 🚀 Key Features
-### 🔍 High‑Accuracy Tagging
-- Predicts **complete morphological tags** (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`) for every token.
-- Handles a rich tagset of **1,181 unique morphological combinations**.
-### 🌐 Multilingual Transfer
-- Leverages the power of **mBERT** (trained on 104 languages) to achieve excellent performance on Tatar with limited fine‑tuning data.
-### 📦 Easy Integration
-- Ready‑to‑use via Hugging Face `transformers` pipeline.
-- Compatible with `token-classification` and `ner` pipelines.
-## 📈 Performance Metrics
-| Metric               | Value      | 95% Confidence Interval     |
-|----------------------|------------|-----------------------------|
-| **Token Accuracy**   | **98.68%** | [98.58%, 98.78%]            |
-| **F1 (micro)**       | **98.68%** | [98.58%, 98.78%]            |
-| **F1 (macro)**       | **50.94%** | [48.73%, 53.15%]            |
-| Precision (micro)    | 98.68%     | —                           |
-| Recall (micro)       | 98.68%     | —                           |
-### Accuracy by Part‑of‑Speech (Top 5 Frequent POS)
-| POS   | Accuracy |
-|-------|----------|
-| PUNCT | 100.00%  |
-| NOUN  | 98.75%   |
-| VERB  | 98.12%   |
-| ADP   | 99.65%   |
-| ADJ   | 97.50%   |
-> *Full POS breakdown is available in the [`results/`](results/) folder of this repository.*
 ## 🎮 Quick Start Examples
-### Using the Pipeline (Easiest)
-```python
-from transformers import pipeline
-pipe = pipeline(
-    "token-classification",
-    model="TatarNLPWorld/tatar-morph-mbert",
-    aggregation_strategy="simple"
-)
-sentence = "Мин татарча сөйләшәм."
-results = pipe(sentence)
-for r in results:
-    print(f"{r['word']}: {r['entity']}")
-```
-**Output:**
-```
-Мин: Pron+Sg+Nom+Pers(1)
-татарча: Adv
-сөйләшәм: V+Pres+1
-.: PUNCT
-```
-### Manual Inference (with tokenizer)
-```python
-from transformers import AutoTokenizer, AutoModelForTokenClassification
-import torch
-tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/tatar-morph-mbert")
-model = AutoModelForTokenClassification.from_pretrained("TatarNLPWorld/tatar-morph-mbert")
-sentence = "Кичә мин дусларым белән паркка бардым."
-inputs = tokenizer(sentence, return_tensors="pt", is_split_into_words=False)
-with torch.no_grad():
-    outputs = model(**inputs).logits
-predictions = torch.argmax(outputs, dim=2)
-tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
-word_ids = inputs.word_ids()
-prev_word = None
-for token, pred, word_id in zip(tokens, predictions[0], word_ids):
-    if word_id is not None and word_id != prev_word:
-        tag = model.config.id2label[pred.item()]
-        print(f"{token}: {tag}")
-        prev_word = word_id
-```
-## 🏗️ Technical Architecture
-### Model Details
-- **Base model**: [`bert-base-multilingual-cased`](https://huggingface.co/bert-base-multilingual-cased) (12 layers, 768 hidden size, 12 heads, ~180M params)
-- **Fine‑tuning task**: Token classification with a linear head
-- **Tagset size**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
-### Training Data
-- **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
-- **Training subset**: 60,000 sentences (shuffled with seed 42, filtered empty sentences → 59,992)
-- **Split**: Train 47,993 / Validation 5,999 / Test 6,000 sentences
-- **Tagset**: extracted from the dataset (all unique tag sequences)
-### Training Procedure
-#### Hyperparameters
-| Parameter               | Value          |
-|-------------------------|----------------|
-| Batch size (effective)  | 32             |
-| Learning rate           | 2e-5           |
-| Optimizer               | AdamW (wd=0.01)|
-| Warmup steps            | 500            |
-| Number of epochs        | 4              |
-| Max sequence length     | 128            |
-| Mixed precision         | FP16           |
-#### Training Time & Resources
-- **Hardware**: 1× NVIDIA Tesla V100 (32GB)
-- **Training time**: ~6.5 hours
-- **Model size**: ~680 MB (PyTorch checkpoint)
-- **Inference speed**: ~150 sentences/sec on V100
-## 📊 Dataset Details
-The model was trained on the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus), which contains manually annotated morphological analyses.
-| Property               | Value      |
-|------------------------|------------|
-| Total sentences        | 59,992     |
-| Unique tags            | 1,181      |
-| Avg. sentence length   | 8.0 tokens |
-| Median sentence length | 6 tokens   |
-| Language               | Tatar (tt) |
 ## 📜 Citation
-If you use this model in your research, please cite:
 ```bibtex
-@misc{tatar-morph-mbert,
   author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
-  title = {Multilingual BERT for Tatar Morphological Analysis},
   year = {2026},
   publisher = {Hugging Face},
-  howpublished = {\url{https://huggingface.co/TatarNLPWorld/tatar-morph-mbert}}
 }
 ```
 ## 📄 License
-This model is released under the **MIT License**. You are free to use, modify, and distribute it for any purpose, with proper attribution.
 ## 🙏 Acknowledgments
 - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
-- **Base model**: [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased)
-- **Framework**: Hugging Face Transformers
-- **Community**: Tatar language speakers and NLP researchers
 ---
 <div align="center">
-**Empowering Tatar Language Technology**
 *Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
-[Report Issue](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert/discussions) •
-[Request Feature](https://huggingface.co/TatarNLPWorld/tatar-morph-mbert/discussions) •
 [Contact](mailto:arabov.mk@gmail.com)
 </div>

 ---
+title: Tatar Morphological Analyzer
 emoji: 🔤
 colorFrom: blue
 colorTo: green
 pinned: true
 app_file: app.py
 license: mit
+short_description: Interactive demo for 5 state-of-the-art Tagger models
 sdk_version: 1.55.0
 ---
+# 🔤 Tatar Morphological Analyzer
 <div align="center">
+**Interactive exploration of five fine‑tuned models for Tagger morphological tagging**
+*Compare mBERT, RuBERT, DistilBERT, XLM‑R, and Turkish BERT on your own sentences*
+[![🤗 Hugging Face](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue)](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer)
 [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
+[![Models](https://img.shields.io/badge/5-Models-orange)](#-available-models)
+[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
 </div>
 ## 🌟 Overview
+This Space provides a unified interface to **five state‑of‑the‑art models** for morphological analysis of the Tatar language. Each model predicts **full morphological tags** (part‑of‑speech, number, case, possession, etc.) at the token level — a fundamental task for Tatar NLP.
+Choose a model, type a sentence, and instantly see the predicted tags, along with confidence scores and color‑coded visualisation.
 ## 🚀 Key Features
+### 🧩 Model Selection
+- Switch between **5 different transformer models**:
+  - **mBERT** (best overall accuracy)
+  - **RuBERT** (excellent due to Russian proximity)
+  - **DistilBERT** (lightweight, fast)
+  - **XLM‑R** (powerful multilingual)
+  - **Turkish BERT** (good baseline)
+### 🔍 Interactive Analysis
+- **Real‑time tagging**: Get token‑level morphological tags with confidence scores.
+- **Visual badges**: Colour‑coded display of tags for quick scanning.
+- **Example sentences**: Pre‑loaded examples for instant testing.
+### 📊 Model Metrics
+- For each model, you can see its **token accuracy**, **F1‑micro**, and **F1‑macro** directly in the sidebar.
 ## 🎮 Quick Start Examples
+### Try These Sentences
+| Language | Sentence |
+|----------|----------|
+| Tatar | Мин татарча сөйләшәм. |
+| Tatar | Кичә мин дусларым белән паркка бардым. |
+| Tatar | Татарстан – Россия Федерациясе составындагы республика. |
+Just paste any sentence into the text box and click **Analyze**!
+### Expected Output (for mBERT on the first sentence)
+| Word      | Morphological Tag       | Confidence |
+|-----------|-------------------------|------------|
+| Мин       | Pron+Sg+Nom+Pers(1)     | 0.999      |
+| татарча   | Adv                     | 0.998      |
+| сөйләшәм  | V+Pres+1                | 0.997      |
+| .         | PUNCT                   | 1.000      |
+## 📈 Model Performance Comparison
+| Model          | Token Accuracy | F1‑micro | F1‑macro | Speed (sent./sec) |
+|----------------|----------------|----------|----------|-------------------|
+| **mBERT**      | 98.68%         | 98.68%   | 50.94%   | 150               |
+| **RuBERT**     | 98.13%         | 98.13%   | 47.37%   | 150               |
+| **DistilBERT** | 97.98%         | 97.98%   | 44.02%   | 250               |
+| **XLM‑R**      | 97.67%         | 97.67%   | 40.61%   | 120               |
+| **Turkish BERT**| 86.84%        | 86.84%   | 33.34%   | 150               |
+> *Metrics are computed on a held‑out test set of 6,000 sentences (47k+ tokens). Full per‑POS accuracies are available in the `results/` folder of each model repository.*
+## 🏗️ Technical Architecture
+### Models
+All models are fine‑tuned from popular transformer checkpoints:
+| Model      | Base Checkpoint                                       | Parameters |
+|------------|-------------------------------------------------------|------------|
+| mBERT      | `bert-base-multilingual-cased`                        | ~180M      |
+| RuBERT     | `DeepPavlov/rubert-base-cased`                        | ~180M      |
+| DistilBERT | `distilbert-base-multilingual-cased`                  | ~134M      |
+| XLM‑R      | `xlm-roberta-base`                                    | ~560M      |
+| Turkish BERT| `dbmdz/bert-base-turkish-cased`                      | ~180M      |
+### Training Data
+- **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
+- **Training subset**: 60,000 sentences (shuffled seed 42, filtered → 59,992)
+- **Split**: Train 47,993 / Validation 5,999 / Test 6,000
+- **Tagset**: 1,181 unique morphological tags (e.g., `N+Sg+Nom`, `V+Past+3`, `PUNCT`)
+### Hyperparameters (common to all models)
+| Parameter           | Value          |
+|---------------------|----------------|
+| Learning rate       | 2e-5           |
+| Optimizer           | AdamW (wd=0.01)|
+| Warmup steps        | 500            |
+| Number of epochs    | 4              |
+| Max sequence length | 128            |
+| Mixed precision     | FP16           |
+### Training Hardware
+- **GPU**: NVIDIA Tesla V100 (32GB)
+- **Training time**: 4–8 hours per model
+- **Inference speed**: Varies by model (see table above)
+## 📦 Repository Structure
+```
+.
+├── app.py                  # Main Streamlit application
+├── requirements.txt        # Python dependencies
+├── .streamlit/
+│   └── config.toml        # Streamlit server config (port 7860)
+├── results/                # (optional) Additional metrics and plots
+└── README.md              # This file
+```
+## 🚀 Local Deployment
+```bash
+# Clone the Space
+git clone https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer
+cd tatar-morph-analyzer
+# Install dependencies
+pip install -r requirements.txt
+# Run the app
+streamlit run app.py --server.port 8501
+```
+The app will be available at `http://localhost:8501`.
 ## 📜 Citation
+If you use this Space or any of the underlying models in your research, please cite the appropriate model (see each model card for BibTeX). For general attribution:
 ```bibtex
+@misc{tatar-morph-analyzer,
   author = {Arabov, Mullosharaf Kurbonovich and TatarNLPWorld Contributors},
+  title = {Tatar Morphological Analyzer – Interactive Demo},
   year = {2026},
   publisher = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer}}
 }
 ```
 ## 📄 License
+The code in this Space is released under the **MIT License**. Each model retains its own license (all are Apache 2.0 or MIT).
 ## 🙏 Acknowledgments
 - **Dataset**: [TatarNLPWorld/tatar-morphological-corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus)
+- **Model checkpoints**: Hugging Face Hub
+- **Framework**: Streamlit, Transformers, PyTorch
+- **Community**: All contributors to TatarNLPWorld
 ---
 <div align="center">
+**Explore and advance Tagger language technology**
 *Brought to you by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld)*
+[Report Issue](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
+[Request Feature](https://huggingface.co/spaces/TatarNLPWorld/tatar-morph-analyzer/discussions) •
 [Contact](mailto:arabov.mk@gmail.com)
 </div>