Create README.md

1f9b93f verified 23 days ago

6.06 kB

	---
	license: mit
	language:
	- tt
	metrics:
	- accuracy
	base_model:
	- facebook/fasttext-language-identification
	---

	# Tatar2Vec: Word Embeddings for the Tatar Language

	This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: Word2Vec and FastText, with different architectures and vector sizes.

	All models are ready to use with the `gensim` library and can be easily downloaded via the Hugging Face Hub.

	## 📦 Available Models

	The following models are included:

	\| Model Name \| Type \| Architecture \| Vector Size \| #Vectors \| Notes \|
	\|---------------------\|-----------\|--------------\|-------------\|----------\|-------\|
	\| `w2v_cbow_100` \| Word2Vec \| CBOW \| 100 \| 1.29M \| Best overall for semantic analogy tasks \|
	\| `w2v_cbow_200` \| Word2Vec \| CBOW \| 200 \| 1.29M \| Higher dimensionality, more expressive \|
	\| `w2v_sg_100` \| Word2Vec \| Skip-gram \| 100 \| 1.29M \| Often better for rare words \|
	\| `ft_cbow_100` \| FastText \| CBOW \| 100 \| 1.29M \| Handles subword information, good for morphology \|
	\| `ft_cbow_200` \| FastText \| CBOW \| 200 \| 1.29M \| Larger FastText model \|

	All models share the same vocabulary of 1,293,992 unique tokens, achieving 100% coverage on the training corpus.

	## 📁 Repository Structure

	The files are organised in subdirectories for easy access:

	```
	Tatar2Vec/
	├── word2vec/
	│ ├── cbow100/ # w2v_cbow_100 model files
	│ ├── cbow200/ # w2v_cbow_200 model files
	│ └── sg100/ # w2v_sg_100 model files
	└── fasttext/
	├── cbow100/ # ft_cbow_100 model files
	└── cbow200/ # ft_cbow_200 model files
	```

	Each model folder contains the files saved by `gensim` (`.model`, `.npy` vectors, etc.).

	## 🚀 Usage

	### Installation

	First, install the required libraries:

	```bash
	pip install huggingface_hub gensim
	```

	### Download a Model

	Use `snapshot_download` to download all files of a specific model to a local directory:

	```python
	from huggingface_hub import snapshot_download
	import gensim
	import os

	# Download the best Word2Vec CBOW 100 model
	model_path = snapshot_download(
	repo_id="TatarNLPWorld/Tatar2Vec",
	allow_patterns="word2vec/cbow100/*", # only download this model
	local_dir="./tatar2vec_cbow100" # optional local folder
	)

	# Load the model with gensim
	model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model")
	model = gensim.models.Word2Vec.load(model_file)

	# Test it
	print(model.wv.most_similar("татар"))
	```

	Alternatively, you can download the whole repository or individual files using `hf_hub_download`.

	## 📊 Model Comparison

	We evaluated all models on a set of intrinsic tasks:

	- Word analogies (e.g., `Мәскәү:Россия = Казан:?`)
	- Semantic similarity (cosine similarity of related word pairs)
	- Out-of-vocabulary (OOV) handling (for FastText)
	- Nearest neighbours inspection

	The Word2Vec CBOW (100-dim) model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics:

	\| Metric \| Word2Vec (cbow100) \| FastText (cbow100) \|
	\|-----------------------\|---------------------\|---------------------\|
	\| Analogy accuracy \| 60.0% \| 0.0% \|
	\| Avg. semantic similarity \| 0.568 \| 0.582 \|
	\| OOV handling \| N/A \| Good (subword) \|
	\| Vocabulary coverage \| 100% \| 100% \|
	\| Training time \| 1760s \| 3323s \|

	Why Word2Vec? It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation.

	For a detailed report, see the [model comparison results](model_comparison_report.md) (included in the repository).

	## 📝 License

	All models are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.

	## 📜 Certificate

	This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate:

	- Certificate number: 2026610619
	- Title: Tatar2Vec
	- Filing date: December 23, 2025
	- Publication date: January 14, 2026
	- Author: Mullosharaf K. Arabov
	- Applicant: Kazan Federal University

	Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».

	## 🤝 Citation

	If you use these models in your research, please cite the software registration:

	```bibtex
	@software{tatar2vec_2026,
	title = {Tatar2Vec},
	author = {Arabov, Mullosharaf Kurbonvoich},
	year = {2026},
	publisher = {Kazan Federal University},
	note = {Registered software, Certificate No. 2026610619},
	url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec}
	}
	```

	## 🌐 Language

	The models are trained on Tatar text and are intended for use with the Tatar language (language code `tt`).

	## 🙌 Acknowledgements

	These models were trained by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.