--- language: tt license: apache-2.0 datasets: - TatarNLPWorld/tatar-morphological-corpus pipeline_tag: token-classification tags: - tatar - morphology - token-classification - lstm - crf --- # BiLSTM‑CRF for Tatar Morphological Analysis This model is a **BiLSTM‑CRF** trained on 80,000 sentences from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine‑grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`). ## Performance on Test Set | Metric | Value | 95% CI | |--------|-------|--------| | Token Accuracy | 0.9440 | [0.9421, 0.9458] | | Micro F1 | 0.9440 | [0.9420, 0.9459] | | Macro F1 | 0.5330 | [0.5149, 0.5519] | ### Accuracy by Part of Speech (Top 10) | POS | Accuracy | |-----|----------| | PUNCT | 1.0000 | | NOUN | 0.8913 | | VERB | 0.8725 | | ADJ | 0.9418 | | PRON | 0.9900 | | PART | 0.9982 | | PROPN | 0.9248 | | ADP | 1.0000 | | CCONJ | 0.9992 | | ADV | 0.9886 | ## Usage Install required packages: ```bash pip install torch torchcrf transformers huggingface_hub ``` Then load and use the model: ```python import torch import json from torch import nn from torchcrf import CRF from huggingface_hub import hf_hub_download # Define the model class (must match training) class BiLSTMCRF(nn.Module): def __init__(self, vocab_size, emb_dim, hid_dim, num_tags, dropout=0.5): super().__init__() self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=0) self.lstm = nn.LSTM(emb_dim, hid_dim // 2, bidirectional=True, batch_first=True, dropout=dropout) self.dropout = nn.Dropout(dropout) self.classifier = nn.Linear(hid_dim, num_tags) self.crf = CRF(num_tags, batch_first=True) def forward(self, input_ids, mask, labels=None): embeds = self.embedding(input_ids) lstm_out, _ = self.lstm(embeds) lstm_out = self.dropout(lstm_out) emissions = self.classifier(lstm_out) if labels is not None: mask = mask.bool() labels = torch.where(labels == -100, torch.tensor(0, device=labels.device), labels) return -self.crf(emissions, labels, mask=mask, reduction='mean') else: return self.crf.decode(emissions, mask=mask.bool()) # Download required files from Hugging Face repo_id = "TatarNLPWorld/lstm-tatar-morph" config_path = hf_hub_download(repo_id, "config.json") word2id_path = hf_hub_download(repo_id, "word2id.json") weights_path = hf_hub_download(repo_id, "best_model.pt") id2tag_path = hf_hub_download(repo_id, "id2tag.json") # Load hyperparameters with open(config_path) as f: config = json.load(f) with open(word2id_path) as f: word2id = json.load(f) with open(id2tag_path) as f: id2tag = {int(k): v for k, v in json.load(f).items()} # Instantiate model and load weights model = BiLSTMCRF( vocab_size=len(word2id), emb_dim=config['embedding_dim'], hid_dim=config['hidden_dim'], num_tags=config['num_labels'], dropout=config.get('dropout', 0.5) ) model.load_state_dict(torch.load(weights_path, map_location='cpu'), strict=False) model.eval() def predict(tokens, max_len=128): ids = [word2id.get(w, word2id['']) for w in tokens] mask = [1] * len(ids) orig_len = len(ids) if len(ids) > max_len: ids = ids[:max_len] mask = mask[:max_len] tokens = tokens[:max_len] else: ids += [0] * (max_len - len(ids)) mask += [0] * (max_len - len(mask)) input_ids = torch.tensor([ids], dtype=torch.long) mask_tensor = torch.tensor([mask], dtype=torch.long) with torch.no_grad(): preds = model(input_ids, mask_tensor)[0] preds = preds[:orig_len] return [id2tag[p] for p in preds] # Example tokens = ["Татар", "теле", "бик", "бай", "."] tags = predict(tokens) for token, tag in zip(tokens, tags): print(f"{token} -> {tag}") ``` Expected output: ``` Татар -> N+Sg+Nom теле -> N+Sg+POSS_3(СЫ)+Nom бик -> Adv бай -> Adj . -> PUNCT ``` ## Citation If you use this model, please cite it as: ```bibtex @misc{arabov-lstm-tatar-morph-2026, title = {BiLSTM‑CRF for Tatar Morphological Analysis}, author = {Arabov Mullosharaf Kurbonovich}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/TatarNLPWorld/lstm-tatar-morph} } ``` ## License Apache 2.0