Tatar2Vec - Tatar Word Embeddings

High-quality pretrained word embeddings for the Tatar language

This repository contains the best performing FastText and Word2Vec models for Tatar, selected through comprehensive evaluation of 57 different model configurations. These embeddings significantly outperform pre-trained Meta models and are ideal for NLP tasks in Tatar.

🏆 Model Performance

Performance Comparison

Model	Type	Composite Score	Semantic Mean	Vocabulary
`ft_dim100_win5_min5_ngram3-6_sg.epoch1`	Fasttext	0.7019	0.7368	637.7K
`ft_dim100_win5_min5_ngram3-6_sg.epoch3`	Fasttext	0.6675	0.6894	637.7K
`w2v_dim200_win5_min5_sg.epoch4`	Word2Vec	0.5685	0.4445	637.7K
`w2v_dim100_win5_min5_sg`	Word2Vec	0.5566	0.5187	637.7K
`cc.tt.300.bin`	Meta	0.2894	-	945K
`cc.tt.300`	Meta	0.2000	-	945K
`cc.tt.300.vec`	Meta	0.1339	-	945K

Key Findings

FastText models outperform Word2Vec in composite score
Skip-gram architecture beats CBOW for both architectures
100-dimensional models perform better than 200/300-dimensional
Our best model achieves 0.7019, vs Meta's cc.tt.300 at 0.2000

📊 Model Details

FastText Models

FastText models with subword information support.

`ft_dim100_win5_min5_ngram3-6_sg.epoch1`

Composite Score: 0.7019 | Dimensions: 100 | Vocabulary: 637.7K

Semantic Similarity: 0.7368
Analogy Accuracy: 0.0476
OOV Handling: 1.0000
Neighbor Coherence: 0.9588

Top performing FastText model with skip-gram architecture

`ft_dim100_win5_min5_ngram3-6_sg.epoch3`

Composite Score: 0.6675 | Dimensions: 100 | Vocabulary: 637.7K

Semantic Similarity: 0.6894
Analogy Accuracy: 0.0476
OOV Handling: 1.0000
Neighbor Coherence: 0.9388

High-quality FastText alternative with skip-gram

Word2Vec Models

Classical Word2Vec embeddings optimized for Tatar.

`w2v_dim200_win5_min5_sg.epoch4`

Composite Score: 0.5685 | Dimensions: 200 | Vocabulary: 637.7K

Semantic Similarity: 0.4445
Analogy Accuracy: 0.3214
OOV Handling: 0.3854
Neighbor Coherence: 0.7307

Best performing Word2Vec model with 200 dimensions

`w2v_dim100_win5_min5_sg`

Composite Score: 0.5566 | Dimensions: 100 | Vocabulary: 637.7K

Semantic Similarity: 0.5187
Analogy Accuracy: 0.2500
OOV Handling: 0.3854
Neighbor Coherence: 0.8051

Compact and efficient Word2Vec model

📚 Training Corpus

Total Tokens: 207.02M
Unique Words: 2.1M
Vocabulary: 637.7K words
Models Analyzed: 57

Corpus Domains

Domain	Documents
belgech.ru	46
intertat.tatar	19.5K
matbugat.ru	44.9K
azatliq.org	8,1k
tatar-inform.tatar	1.5K
mamadysh-rt	1.2k
vk.com	6.5K
shahrikazan.ru	2.4K
vatantat.ru	119
Wikipedia	456.1K
Books	876

🚀 Quick Start

Installation

pip install gensim huggingface_hub

Load FastText Model (Recommended)

from huggingface_hub import snapshot_download
from gensim.models import FastText
import os

# Download and load the best model
model_dir = snapshot_download(
    repo_id="arabovs-ai-lab/Tatar2Vec",
    allow_patterns="fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/*"
)

model_path = os.path.join(model_dir, "fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
model = FastText.load(model_path)

# Find similar words
similar_words = model.wv.most_similar('мәктәп', topn=5)  # school
print(similar_words)

Load Word2Vec Model

from huggingface_hub import snapshot_download
from gensim.models import Word2Vec
import os

# Download and load Word2Vec model
model_dir = snapshot_download(
    repo_id="arabovs-ai-lab/Tatar2Vec",
    allow_patterns="word2vec/w2v_dim200_win5_min5_sg.epoch4/*"
)

model_path = os.path.join(model_dir, "word2vec/w2v_dim200_win5_min5_sg.epoch4/w2v_dim200_win5_min5_sg.epoch4.model")
model = Word2Vec.load(model_path)

# Find similar words
similar_words = model.wv.most_similar('китап', topn=5)  # book
print(similar_words)

💡 Usage Examples

Education & Science

# Education related words
education_words = model.wv.most_similar('укыту', topn=5)  # teaching
science_words = model.wv.most_similar('фән', topn=5)      # science
student_words = model.wv.most_similar('студент', topn=5)  # student

print("Education:", education_words)
print("Science:", science_words)
print("Student:", student_words)

Nature & Environment

# Nature related words
nature_words = model.wv.most_similar('табигать', topn=5)  # nature
water_words = model.wv.most_similar('су', topn=5)         # water
tree_words = model.wv.most_similar('агач', topn=5)        # tree

print("Nature:", nature_words)
print("Water:", water_words)
print("Tree:", tree_words)

Culture & Arts

# Culture and arts
culture_words = model.wv.most_similar('мәдәният', topn=5)  # culture
music_words = model.wv.most_similar('музыка', topn=5)      # music
art_words = model.wv.most_similar('сәнгать', topn=5)       # art

print("Culture:", culture_words)
print("Music:", music_words)
print("Art:", art_words)

Daily Life

# Everyday words
family_words = model.wv.most_similar('гаилә', topn=5)    # family
work_words = model.wv.most_similar('эш', topn=5)         # work
home_words = model.wv.most_similar('өй', topn=5)         # home

print("Family:", family_words)
print("Work:", work_words)
print("Home:", home_words)

Food & Cuisine

# Food related words
food_words = model.wv.most_similar('ашамлык', topn=5)    # food
bread_words = model.wv.most_similar('икмәк', topn=5)     # bread
meal_words = model.wv.most_similar('аш', topn=5)         # meal

print("Food:", food_words)
print("Bread:", bread_words)
print("Meal:", meal_words)

Technology

# Technology words
tech_words = model.wv.most_similar('компьютер', topn=5)  # computer
internet_words = model.wv.most_similar('интернет', topn=5) # internet
phone_words = model.wv.most_similar('телефон', topn=5)   # phone

print("Computer:", tech_words)
print("Internet:", internet_words)
print("Phone:", phone_words)

Emotions & Feelings

# Emotional words
happy_words = model.wv.most_similar('бәхетле', topn=5)   # happy
sad_words = model.wv.most_similar('кайгы', topn=5)       # sadness
love_words = model.wv.most_similar('мәхәббәт', topn=5)   # love

print("Happy:", happy_words)
print("Sadness:", sad_words)
print("Love:", love_words)

Word Analogies

# Word analogies (like king - man + woman = queen)
# doctor - man + woman = ?
analogy = model.wv.most_similar(
    positive=['табиб', 'хатын'],  # doctor, woman
    negative=['ир'],              # man
    topn=3
)
print("Doctor - man + woman =", analogy)

# Paris - France + Germany = Berlin?
analogy = model.wv.most_similar(
    positive=['Париж', 'Алмания'],  # Paris, Germany
    negative=['Франция'],           # France
    topn=3
)
print("Paris - France + Germany =", analogy)

FastText Out-of-Vocabulary (OOV) Handling

# FastText can handle words not in the training vocabulary
if hasattr(model, 'wv') and hasattr(model.wv, 'get_vector'):
    # Try some made-up or rare words
    oov_words = ['технологияләштерү', 'цифрлаштыру', 'виртуальлаштыру']
    for word in oov_words:
        try:
            vector = model.wv[word]
            similar = model.wv.most_similar(word, topn=3)
            print(f"OOV word '{word}': {similar}")
        except Exception as e:
            print(f"Couldn't process '{word}': {e}")

Universal Loader for All Models

from huggingface_hub import snapshot_download
from gensim.models import FastText, Word2Vec
import os

def load_tatar2vec_model(model_name, model_type="fasttext"):
    """Load any Tatar2Vec model easily"""
    model_dir = snapshot_download(
        repo_id="arabovs-ai-lab/Tatar2Vec",
        allow_patterns=f"{model_type}/{model_name}/*"
    )
    
    model_path = os.path.join(model_dir, model_type, model_name, f"{model_name}.model")
    
    if model_type == "fasttext":
        return FastText.load(model_path)
    else:
        return Word2Vec.load(model_path)

# Test different models with various topics
test_words = ['тел', 'мәктәп', 'китап', 'фән', 'табигать']

models_to_test = [
    ("ft_dim100_win5_min5_ngram3-6_sg.epoch1", "fasttext", "🥇 Best FastText"),
    ("ft_dim100_win5_min5_ngram3-6_sg.epoch3", "fasttext", "🥈 Alternative FastText"),
    ("w2v_dim200_win5_min5_sg.epoch4", "word2vec", "🥇 Best Word2Vec"),
    ("w2v_dim100_win5_min5_sg", "word2vec", "🥈 Compact Word2Vec")
]

for model_name, model_type, description in models_to_test:
    print(f"\n{description}: {model_name}")
    print("=" * 50)
    
    model = load_tatar2vec_model(model_name, model_type)
    
    for word in test_words:
        try:
            similar = model.wv.most_similar(word, topn=2)
            print(f"  '{word}': {[f'{w}({s:.3f})' for w, s in similar]}")
        except KeyError:
            print(f"  '{word}': not in vocabulary")

🎯 Model Performance Summary

Model	Type	Best For	Example Use Case
`ft_dim100_win5_min5_ngram3-6_sg.epoch1`	FastText	Overall performance, OOV words	`model.wv.most_similar('мәктәп')`
`ft_dim100_win5_min5_ngram3-6_sg.epoch3`	FastText	Alternative with great results	`model.wv.most_similar('фән')`
`w2v_dim200_win5_min5_sg.epoch4`	Word2Vec	Semantic relationships	`model.wv.most_similar('тел')`
`w2v_dim100_win5_min5_sg`	Word2Vec	Lightweight applications	`model.wv.most_similar('китап')`

These examples showcase the models' understanding of various domains in the Tatar language! 🌟

📜 Citation

@misc{Tatar2Vec_20251109,
  title = {Tatar2Vec: Tatar Word Embeddings},
  author = {Arabovs AI Lab},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}
}

📄 License

MIT License

Last updated: 2025-11-09
Best score: 0.7019
Corpus: 207.02M tokens

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

arabovs-ai-lab
/

Tatar2Vec