Update README.md

f720d9c verified 17 days ago

4.42 kB

	---
	language: tt
	license: apache-2.0
	datasets:
	- TatarNLPWorld/tatar-morphological-corpus
	pipeline_tag: token-classification
	tags:
	- tatar
	- morphology
	- token-classification
	- lstm
	- crf
	---

	# BiLSTM‑CRF for Tatar Morphological Analysis

	This model is a BiLSTM‑CRF trained on 80,000 sentences from the [Tatar Morphological Corpus](https://huggingface.co/datasets/TatarNLPWorld/tatar-morphological-corpus). It predicts fine‑grained morphological tags (e.g., `N+Sg+Nom`, `V+PRES(Й)+3SG`).

	## Performance on Test Set

	\| Metric \| Value \| 95% CI \|
	\|--------\|-------\|--------\|
	\| Token Accuracy \| 0.9440 \| [0.9421, 0.9458] \|
	\| Micro F1 \| 0.9440 \| [0.9420, 0.9459] \|
	\| Macro F1 \| 0.5330 \| [0.5149, 0.5519] \|

	### Accuracy by Part of Speech (Top 10)

	\| POS \| Accuracy \|
	\|-----\|----------\|
	\| PUNCT \| 1.0000 \|
	\| NOUN \| 0.8913 \|
	\| VERB \| 0.8725 \|
	\| ADJ \| 0.9418 \|
	\| PRON \| 0.9900 \|
	\| PART \| 0.9982 \|
	\| PROPN \| 0.9248 \|
	\| ADP \| 1.0000 \|
	\| CCONJ \| 0.9992 \|
	\| ADV \| 0.9886 \|

	## Usage

	Install required packages:

	```bash
	pip install torch torchcrf transformers huggingface_hub
	```

	Then load and use the model:

	```python
	import torch
	import json
	from torch import nn
	from torchcrf import CRF
	from huggingface_hub import hf_hub_download

	# Define the model class (must match training)
	class BiLSTMCRF(nn.Module):
	def __init__(self, vocab_size, emb_dim, hid_dim, num_tags, dropout=0.5):
	super().__init__()
	self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
	self.lstm = nn.LSTM(emb_dim, hid_dim // 2, bidirectional=True, batch_first=True, dropout=dropout)
	self.dropout = nn.Dropout(dropout)
	self.classifier = nn.Linear(hid_dim, num_tags)
	self.crf = CRF(num_tags, batch_first=True)

	def forward(self, input_ids, mask, labels=None):
	embeds = self.embedding(input_ids)
	lstm_out, _ = self.lstm(embeds)
	lstm_out = self.dropout(lstm_out)
	emissions = self.classifier(lstm_out)
	if labels is not None:
	mask = mask.bool()
	labels = torch.where(labels == -100, torch.tensor(0, device=labels.device), labels)
	return -self.crf(emissions, labels, mask=mask, reduction='mean')
	else:
	return self.crf.decode(emissions, mask=mask.bool())

	# Download required files from Hugging Face
	repo_id = "TatarNLPWorld/lstm-tatar-morph"
	config_path = hf_hub_download(repo_id, "config.json")
	word2id_path = hf_hub_download(repo_id, "word2id.json")
	weights_path = hf_hub_download(repo_id, "best_model.pt")
	id2tag_path = hf_hub_download(repo_id, "id2tag.json")

	# Load hyperparameters
	with open(config_path) as f:
	config = json.load(f)

	with open(word2id_path) as f:
	word2id = json.load(f)

	with open(id2tag_path) as f:
	id2tag = {int(k): v for k, v in json.load(f).items()}

	# Instantiate model and load weights
	model = BiLSTMCRF(
	vocab_size=len(word2id),
	emb_dim=config['embedding_dim'],
	hid_dim=config['hidden_dim'],
	num_tags=config['num_labels'],
	dropout=config.get('dropout', 0.5)
	)
	model.load_state_dict(torch.load(weights_path, map_location='cpu'), strict=False)
	model.eval()

	def predict(tokens, max_len=128):
	ids = [word2id.get(w, word2id['<UNK>']) for w in tokens]
	mask = [1] * len(ids)
	orig_len = len(ids)

	if len(ids) > max_len:
	ids = ids[:max_len]
	mask = mask[:max_len]
	tokens = tokens[:max_len]
	else:
	ids += [0] * (max_len - len(ids))
	mask += [0] * (max_len - len(mask))

	input_ids = torch.tensor([ids], dtype=torch.long)
	mask_tensor = torch.tensor([mask], dtype=torch.long)

	with torch.no_grad():
	preds = model(input_ids, mask_tensor)[0]

	preds = preds[:orig_len]
	return [id2tag[p] for p in preds]

	# Example
	tokens = ["Татар", "теле", "бик", "бай", "."]
	tags = predict(tokens)
	for token, tag in zip(tokens, tags):
	print(f"{token} -> {tag}")
	```

	Expected output:

	```
	Татар -> N+Sg+Nom
	теле -> N+Sg+POSS_3(СЫ)+Nom
	бик -> Adv
	бай -> Adj
	. -> PUNCT
	```

	## Citation

	If you use this model, please cite it as:

	```bibtex
	@misc{arabov-lstm-tatar-morph-2026,
	title = {BiLSTM‑CRF for Tatar Morphological Analysis},
	author = {Arabov Mullosharaf Kurbonovich},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/TatarNLPWorld/lstm-tatar-morph}
	}
	```

	## License

	Apache 2.0