| --- |
| license: mit |
| language: |
| - tt |
| metrics: |
| - accuracy |
| base_model: |
| - facebook/fasttext-language-identification |
| --- |
| |
| # Tatar2Vec: Word Embeddings for the Tatar Language |
|
|
| This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: **Word2Vec** and **FastText**, with different architectures and vector sizes. |
|
|
| All models are ready to use with the `gensim` library and can be easily downloaded via the Hugging Face Hub. |
|
|
| ## 📦 Available Models |
|
|
| The following models are included: |
|
|
| | Model Name | Type | Architecture | Vector Size | #Vectors | Notes | |
| |---------------------|-----------|--------------|-------------|----------|-------| |
| | `w2v_cbow_100` | Word2Vec | CBOW | 100 | 1.29M | Best overall for semantic analogy tasks | |
| | `w2v_cbow_200` | Word2Vec | CBOW | 200 | 1.29M | Higher dimensionality, more expressive | |
| | `w2v_sg_100` | Word2Vec | Skip-gram | 100 | 1.29M | Often better for rare words | |
| | `ft_cbow_100` | FastText | CBOW | 100 | 1.29M | Handles subword information, good for morphology | |
| | `ft_cbow_200` | FastText | CBOW | 200 | 1.29M | Larger FastText model | |
|
|
| All models share the same vocabulary of **1,293,992** unique tokens, achieving **100% coverage** on the training corpus. |
|
|
| ## 📁 Repository Structure |
|
|
| The files are organised in subdirectories for easy access: |
|
|
| ``` |
| Tatar2Vec/ |
| ├── word2vec/ |
| │ ├── cbow100/ # w2v_cbow_100 model files |
| │ ├── cbow200/ # w2v_cbow_200 model files |
| │ └── sg100/ # w2v_sg_100 model files |
| └── fasttext/ |
| ├── cbow100/ # ft_cbow_100 model files |
| └── cbow200/ # ft_cbow_200 model files |
| ``` |
|
|
| Each model folder contains the files saved by `gensim` (`.model`, `.npy` vectors, etc.). |
|
|
| ## 🚀 Usage |
|
|
| ### Installation |
|
|
| First, install the required libraries: |
|
|
| ```bash |
| pip install huggingface_hub gensim |
| ``` |
|
|
| ### Download a Model |
|
|
| Use `snapshot_download` to download all files of a specific model to a local directory: |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| import gensim |
| import os |
| |
| # Download the best Word2Vec CBOW 100 model |
| model_path = snapshot_download( |
| repo_id="TatarNLPWorld/Tatar2Vec", |
| allow_patterns="word2vec/cbow100/*", # only download this model |
| local_dir="./tatar2vec_cbow100" # optional local folder |
| ) |
| |
| # Load the model with gensim |
| model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model") |
| model = gensim.models.Word2Vec.load(model_file) |
| |
| # Test it |
| print(model.wv.most_similar("татар")) |
| ``` |
|
|
| Alternatively, you can download the whole repository or individual files using `hf_hub_download`. |
|
|
| ## 📊 Model Comparison |
|
|
| We evaluated all models on a set of intrinsic tasks: |
|
|
| - **Word analogies** (e.g., `Мәскәү:Россия = Казан:?`) |
| - **Semantic similarity** (cosine similarity of related word pairs) |
| - **Out-of-vocabulary (OOV)** handling (for FastText) |
| - **Nearest neighbours inspection** |
|
|
| The **Word2Vec CBOW (100-dim)** model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics: |
|
|
| | Metric | Word2Vec (cbow100) | FastText (cbow100) | |
| |-----------------------|---------------------|---------------------| |
| | Analogy accuracy | 60.0% | 0.0% | |
| | Avg. semantic similarity | 0.568 | 0.582 | |
| | OOV handling | N/A | Good (subword) | |
| | Vocabulary coverage | 100% | 100% | |
| | Training time | 1760s | 3323s | |
|
|
| **Why Word2Vec?** It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation. |
|
|
| For a detailed report, see the [model comparison results](model_comparison_report.md) (included in the repository). |
|
|
| ## 📝 License |
|
|
| All models are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution. |
|
|
| ## 📜 Certificate |
|
|
| This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate: |
|
|
| - **Certificate number**: 2026610619 |
| - **Title**: Tatar2Vec |
| - **Filing date**: December 23, 2025 |
| - **Publication date**: January 14, 2026 |
| - **Author**: Mullosharaf K. Arabov |
| - **Applicant**: Kazan Federal University |
|
|
| *Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».* |
|
|
| ## 🤝 Citation |
|
|
| If you use these models in your research, please cite the software registration: |
|
|
| ```bibtex |
| @software{tatar2vec_2026, |
| title = {Tatar2Vec}, |
| author = {Arabov, Mullosharaf Kurbonvoich}, |
| year = {2026}, |
| publisher = {Kazan Federal University}, |
| note = {Registered software, Certificate No. 2026610619}, |
| url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec} |
| } |
| ``` |
|
|
| ## 🌐 Language |
|
|
| The models are trained on Tatar text and are intended for use with the Tatar language (language code `tt`). |
|
|
| ## 🙌 Acknowledgements |
|
|
| These models were trained by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible. |