---
language: tt
license: apache-2.0
datasets:
- TatarNLPWorld/tatar-morphological-corpus
pipeline_tag: token-classification
tags:
- tatar
- morphology
- token-classification
- lstm
- crf
---

# BiLSTM‑CRF for Tatar Morphological Analysis

This model is a **BiLSTM‑CRF** trained on 80,000 sentences from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine‑grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).

## Performance on Test Set

| Metric | Value | 95% CI |
|--------|-------|--------|
| Token Accuracy | 0.9440 | [0.9421, 0.9458] |
| Micro F1 | 0.9440 | [0.9420, 0.9459] |
| Macro F1 | 0.5330 | [0.5149, 0.5519] |

### Accuracy by Part of Speech (Top 10)

| POS | Accuracy |
|-----|----------|
| PUNCT | 1.0000 |
| NOUN | 0.8913 |
| VERB | 0.8725 |
| ADJ | 0.9418 |
| PRON | 0.9900 |
| PART | 0.9982 |
| PROPN | 0.9248 |
| ADP | 1.0000 |
| CCONJ | 0.9992 |
| ADV | 0.9886 |

## Usage

Install required packages:

```bash
pip install torch torchcrf transformers huggingface_hub
```

Then load and use the model:

```python
import torch
import json
from torch import nn
from torchcrf import CRF
from huggingface_hub import hf_hub_download

# Define the model class (must match training)
class BiLSTMCRF(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_tags, dropout=0.5):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.lstm = nn.LSTM(emb_dim, hid_dim // 2, bidirectional=True, batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hid_dim, num_tags)
        self.crf = CRF(num_tags, batch_first=True)

    def forward(self, input_ids, mask, labels=None):
        embeds = self.embedding(input_ids)
        lstm_out, _ = self.lstm(embeds)
        lstm_out = self.dropout(lstm_out)
        emissions = self.classifier(lstm_out)
        if labels is not None:
            mask = mask.bool()
            labels = torch.where(labels == -100, torch.tensor(0, device=labels.device), labels)
            return -self.crf(emissions, labels, mask=mask, reduction='mean')
        else:
            return self.crf.decode(emissions, mask=mask.bool())

# Download required files from Hugging Face
repo_id = "TatarNLPWorld/lstm-tatar-morph"
config_path = hf_hub_download(repo_id, "config.json")
word2id_path = hf_hub_download(repo_id, "word2id.json")
weights_path = hf_hub_download(repo_id, "best_model.pt")
id2tag_path = hf_hub_download(repo_id, "id2tag.json")

# Load hyperparameters
with open(config_path) as f:
    config = json.load(f)

with open(word2id_path) as f:
    word2id = json.load(f)

with open(id2tag_path) as f:
    id2tag = {int(k): v for k, v in json.load(f).items()}

# Instantiate model and load weights
model = BiLSTMCRF(
    vocab_size=len(word2id),
    emb_dim=config['embedding_dim'],
    hid_dim=config['hidden_dim'],
    num_tags=config['num_labels'],
    dropout=config.get('dropout', 0.5)
)
model.load_state_dict(torch.load(weights_path, map_location='cpu'), strict=False)
model.eval()

def predict(tokens, max_len=128):
    ids = [word2id.get(w, word2id['<UNK>']) for w in tokens]
    mask = [1] * len(ids)
    orig_len = len(ids)
    
    if len(ids) > max_len:
        ids = ids[:max_len]
        mask = mask[:max_len]
        tokens = tokens[:max_len]
    else:
        ids += [0] * (max_len - len(ids))
        mask += [0] * (max_len - len(mask))
    
    input_ids = torch.tensor([ids], dtype=torch.long)
    mask_tensor = torch.tensor([mask], dtype=torch.long)
    
    with torch.no_grad():
        preds = model(input_ids, mask_tensor)[0]
    
    preds = preds[:orig_len]
    return [id2tag[p] for p in preds]

# Example
tokens = ["Татар", "теле", "бик", "бай", "."]
tags = predict(tokens)
for token, tag in zip(tokens, tags):
    print(f"{token} -> {tag}")
```

Expected output:

```
Татар -> N+Sg+Nom
теле -> N+Sg+POSS_3(СЫ)+Nom
бик -> Adv
бай -> Adj
. -> PUNCT
```

## Citation

If you use this model, please cite it as:

```bibtex
@misc{arabov-lstm-tatar-morph-2026,
  title = {BiLSTM‑CRF for Tatar Morphological Analysis},
  author = {Arabov Mullosharaf Kurbonovich},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/TatarNLPWorld/lstm-tatar-morph}
}
```

## License

Apache 2.0