Tatar2Vec - Tatar Word Embeddings
High-quality pretrained word embeddings for the Tatar language
This repository contains the best performing FastText and Word2Vec models for Tatar, selected through comprehensive evaluation of 57 different model configurations. These embeddings significantly outperform pre-trained Meta models and are ideal for NLP tasks in Tatar.
🏆 Model Performance
Performance Comparison
| Model | Type | Composite Score | Semantic Mean | Vocabulary |
|---|---|---|---|---|
ft_dim100_win5_min5_ngram3-6_sg.epoch1 |
Fasttext | 0.7019 | 0.7368 | 637.7K |
ft_dim100_win5_min5_ngram3-6_sg.epoch3 |
Fasttext | 0.6675 | 0.6894 | 637.7K |
w2v_dim200_win5_min5_sg.epoch4 |
Word2Vec | 0.5685 | 0.4445 | 637.7K |
w2v_dim100_win5_min5_sg |
Word2Vec | 0.5566 | 0.5187 | 637.7K |
cc.tt.300.bin |
Meta | 0.2894 | - | 945K |
cc.tt.300 |
Meta | 0.2000 | - | 945K |
cc.tt.300.vec |
Meta | 0.1339 | - | 945K |
Key Findings
- FastText models outperform Word2Vec in composite score
- Skip-gram architecture beats CBOW for both architectures
- 100-dimensional models perform better than 200/300-dimensional
- Our best model achieves 0.7019, vs Meta's
cc.tt.300at 0.2000
📊 Model Details
FastText Models
FastText models with subword information support.
ft_dim100_win5_min5_ngram3-6_sg.epoch1
Composite Score: 0.7019 | Dimensions: 100 | Vocabulary: 637.7K
- Semantic Similarity: 0.7368
- Analogy Accuracy: 0.0476
- OOV Handling: 1.0000
- Neighbor Coherence: 0.9588
Top performing FastText model with skip-gram architecture
ft_dim100_win5_min5_ngram3-6_sg.epoch3
Composite Score: 0.6675 | Dimensions: 100 | Vocabulary: 637.7K
- Semantic Similarity: 0.6894
- Analogy Accuracy: 0.0476
- OOV Handling: 1.0000
- Neighbor Coherence: 0.9388
High-quality FastText alternative with skip-gram
Word2Vec Models
Classical Word2Vec embeddings optimized for Tatar.
w2v_dim200_win5_min5_sg.epoch4
Composite Score: 0.5685 | Dimensions: 200 | Vocabulary: 637.7K
- Semantic Similarity: 0.4445
- Analogy Accuracy: 0.3214
- OOV Handling: 0.3854
- Neighbor Coherence: 0.7307
Best performing Word2Vec model with 200 dimensions
w2v_dim100_win5_min5_sg
Composite Score: 0.5566 | Dimensions: 100 | Vocabulary: 637.7K
- Semantic Similarity: 0.5187
- Analogy Accuracy: 0.2500
- OOV Handling: 0.3854
- Neighbor Coherence: 0.8051
Compact and efficient Word2Vec model
📚 Training Corpus
- Total Tokens: 207.02M
- Unique Words: 2.1M
- Vocabulary: 637.7K words
- Models Analyzed: 57
Corpus Domains
| Domain | Documents |
|---|---|
| belgech.ru | 46 |
| intertat.tatar | 19.5K |
| matbugat.ru | 44.9K |
| azatliq.org | 8,1k |
| tatar-inform.tatar | 1.5K |
| mamadysh-rt | 1.2k |
| vk.com | 6.5K |
| shahrikazan.ru | 2.4K |
| vatantat.ru | 119 |
| Wikipedia | 456.1K |
| Books | 876 |
🚀 Quick Start
Installation
pip install gensim huggingface_hub
Load FastText Model (Recommended)
from huggingface_hub import snapshot_download
from gensim.models import FastText
import os
# Download and load the best model
model_dir = snapshot_download(
repo_id="arabovs-ai-lab/Tatar2Vec",
allow_patterns="fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/*"
)
model_path = os.path.join(model_dir, "fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
model = FastText.load(model_path)
# Find similar words
similar_words = model.wv.most_similar('мәктәп', topn=5) # school
print(similar_words)
Load Word2Vec Model
from huggingface_hub import snapshot_download
from gensim.models import Word2Vec
import os
# Download and load Word2Vec model
model_dir = snapshot_download(
repo_id="arabovs-ai-lab/Tatar2Vec",
allow_patterns="word2vec/w2v_dim200_win5_min5_sg.epoch4/*"
)
model_path = os.path.join(model_dir, "word2vec/w2v_dim200_win5_min5_sg.epoch4/w2v_dim200_win5_min5_sg.epoch4.model")
model = Word2Vec.load(model_path)
# Find similar words
similar_words = model.wv.most_similar('китап', topn=5) # book
print(similar_words)
💡 Usage Examples
Education & Science
# Education related words
education_words = model.wv.most_similar('укыту', topn=5) # teaching
science_words = model.wv.most_similar('фән', topn=5) # science
student_words = model.wv.most_similar('студент', topn=5) # student
print("Education:", education_words)
print("Science:", science_words)
print("Student:", student_words)
Nature & Environment
# Nature related words
nature_words = model.wv.most_similar('табигать', topn=5) # nature
water_words = model.wv.most_similar('су', topn=5) # water
tree_words = model.wv.most_similar('агач', topn=5) # tree
print("Nature:", nature_words)
print("Water:", water_words)
print("Tree:", tree_words)
Culture & Arts
# Culture and arts
culture_words = model.wv.most_similar('мәдәният', topn=5) # culture
music_words = model.wv.most_similar('музыка', topn=5) # music
art_words = model.wv.most_similar('сәнгать', topn=5) # art
print("Culture:", culture_words)
print("Music:", music_words)
print("Art:", art_words)
Daily Life
# Everyday words
family_words = model.wv.most_similar('гаилә', topn=5) # family
work_words = model.wv.most_similar('эш', topn=5) # work
home_words = model.wv.most_similar('өй', topn=5) # home
print("Family:", family_words)
print("Work:", work_words)
print("Home:", home_words)
Food & Cuisine
# Food related words
food_words = model.wv.most_similar('ашамлык', topn=5) # food
bread_words = model.wv.most_similar('икмәк', topn=5) # bread
meal_words = model.wv.most_similar('аш', topn=5) # meal
print("Food:", food_words)
print("Bread:", bread_words)
print("Meal:", meal_words)
Technology
# Technology words
tech_words = model.wv.most_similar('компьютер', topn=5) # computer
internet_words = model.wv.most_similar('интернет', topn=5) # internet
phone_words = model.wv.most_similar('телефон', topn=5) # phone
print("Computer:", tech_words)
print("Internet:", internet_words)
print("Phone:", phone_words)
Emotions & Feelings
# Emotional words
happy_words = model.wv.most_similar('бәхетле', topn=5) # happy
sad_words = model.wv.most_similar('кайгы', topn=5) # sadness
love_words = model.wv.most_similar('мәхәббәт', topn=5) # love
print("Happy:", happy_words)
print("Sadness:", sad_words)
print("Love:", love_words)
Word Analogies
# Word analogies (like king - man + woman = queen)
# doctor - man + woman = ?
analogy = model.wv.most_similar(
positive=['табиб', 'хатын'], # doctor, woman
negative=['ир'], # man
topn=3
)
print("Doctor - man + woman =", analogy)
# Paris - France + Germany = Berlin?
analogy = model.wv.most_similar(
positive=['Париж', 'Алмания'], # Paris, Germany
negative=['Франция'], # France
topn=3
)
print("Paris - France + Germany =", analogy)
FastText Out-of-Vocabulary (OOV) Handling
# FastText can handle words not in the training vocabulary
if hasattr(model, 'wv') and hasattr(model.wv, 'get_vector'):
# Try some made-up or rare words
oov_words = ['технологияләштерү', 'цифрлаштыру', 'виртуальлаштыру']
for word in oov_words:
try:
vector = model.wv[word]
similar = model.wv.most_similar(word, topn=3)
print(f"OOV word '{word}': {similar}")
except Exception as e:
print(f"Couldn't process '{word}': {e}")
Universal Loader for All Models
from huggingface_hub import snapshot_download
from gensim.models import FastText, Word2Vec
import os
def load_tatar2vec_model(model_name, model_type="fasttext"):
"""Load any Tatar2Vec model easily"""
model_dir = snapshot_download(
repo_id="arabovs-ai-lab/Tatar2Vec",
allow_patterns=f"{model_type}/{model_name}/*"
)
model_path = os.path.join(model_dir, model_type, model_name, f"{model_name}.model")
if model_type == "fasttext":
return FastText.load(model_path)
else:
return Word2Vec.load(model_path)
# Test different models with various topics
test_words = ['тел', 'мәктәп', 'китап', 'фән', 'табигать']
models_to_test = [
("ft_dim100_win5_min5_ngram3-6_sg.epoch1", "fasttext", "🥇 Best FastText"),
("ft_dim100_win5_min5_ngram3-6_sg.epoch3", "fasttext", "🥈 Alternative FastText"),
("w2v_dim200_win5_min5_sg.epoch4", "word2vec", "🥇 Best Word2Vec"),
("w2v_dim100_win5_min5_sg", "word2vec", "🥈 Compact Word2Vec")
]
for model_name, model_type, description in models_to_test:
print(f"\n{description}: {model_name}")
print("=" * 50)
model = load_tatar2vec_model(model_name, model_type)
for word in test_words:
try:
similar = model.wv.most_similar(word, topn=2)
print(f" '{word}': {[f'{w}({s:.3f})' for w, s in similar]}")
except KeyError:
print(f" '{word}': not in vocabulary")
🎯 Model Performance Summary
| Model | Type | Best For | Example Use Case |
|---|---|---|---|
ft_dim100_win5_min5_ngram3-6_sg.epoch1 |
FastText | Overall performance, OOV words | model.wv.most_similar('мәктәп') |
ft_dim100_win5_min5_ngram3-6_sg.epoch3 |
FastText | Alternative with great results | model.wv.most_similar('фән') |
w2v_dim200_win5_min5_sg.epoch4 |
Word2Vec | Semantic relationships | model.wv.most_similar('тел') |
w2v_dim100_win5_min5_sg |
Word2Vec | Lightweight applications | model.wv.most_similar('китап') |
These examples showcase the models' understanding of various domains in the Tatar language! 🌟
📜 Citation
@misc{Tatar2Vec_20251109,
title = {Tatar2Vec: Tatar Word Embeddings},
author = {Arabovs AI Lab},
year = 2025,
publisher = {Hugging Face},
url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}
}
📄 License
MIT License
Last updated: 2025-11-09
Best score: 0.7019
Corpus: 207.02M tokens
- Downloads last month
- -