Tatar2Vec - Tatar Word Embeddings

High-quality pretrained word embeddings for the Tatar language

This repository contains the best performing FastText and Word2Vec models for Tatar, selected through comprehensive evaluation of 57 different model configurations. These embeddings significantly outperform pre-trained Meta models and are ideal for NLP tasks in Tatar.

🏆 Model Performance

Performance Comparison

Model Type Composite Score Semantic Mean Vocabulary
ft_dim100_win5_min5_ngram3-6_sg.epoch1 Fasttext 0.7019 0.7368 637.7K
ft_dim100_win5_min5_ngram3-6_sg.epoch3 Fasttext 0.6675 0.6894 637.7K
w2v_dim200_win5_min5_sg.epoch4 Word2Vec 0.5685 0.4445 637.7K
w2v_dim100_win5_min5_sg Word2Vec 0.5566 0.5187 637.7K
cc.tt.300.bin Meta 0.2894 - 945K
cc.tt.300 Meta 0.2000 - 945K
cc.tt.300.vec Meta 0.1339 - 945K

Key Findings

  • FastText models outperform Word2Vec in composite score
  • Skip-gram architecture beats CBOW for both architectures
  • 100-dimensional models perform better than 200/300-dimensional
  • Our best model achieves 0.7019, vs Meta's cc.tt.300 at 0.2000

📊 Model Details

FastText Models

FastText models with subword information support.

ft_dim100_win5_min5_ngram3-6_sg.epoch1

Composite Score: 0.7019 | Dimensions: 100 | Vocabulary: 637.7K

  • Semantic Similarity: 0.7368
  • Analogy Accuracy: 0.0476
  • OOV Handling: 1.0000
  • Neighbor Coherence: 0.9588

Top performing FastText model with skip-gram architecture

ft_dim100_win5_min5_ngram3-6_sg.epoch3

Composite Score: 0.6675 | Dimensions: 100 | Vocabulary: 637.7K

  • Semantic Similarity: 0.6894
  • Analogy Accuracy: 0.0476
  • OOV Handling: 1.0000
  • Neighbor Coherence: 0.9388

High-quality FastText alternative with skip-gram

Word2Vec Models

Classical Word2Vec embeddings optimized for Tatar.

w2v_dim200_win5_min5_sg.epoch4

Composite Score: 0.5685 | Dimensions: 200 | Vocabulary: 637.7K

  • Semantic Similarity: 0.4445
  • Analogy Accuracy: 0.3214
  • OOV Handling: 0.3854
  • Neighbor Coherence: 0.7307

Best performing Word2Vec model with 200 dimensions

w2v_dim100_win5_min5_sg

Composite Score: 0.5566 | Dimensions: 100 | Vocabulary: 637.7K

  • Semantic Similarity: 0.5187
  • Analogy Accuracy: 0.2500
  • OOV Handling: 0.3854
  • Neighbor Coherence: 0.8051

Compact and efficient Word2Vec model

📚 Training Corpus

  • Total Tokens: 207.02M
  • Unique Words: 2.1M
  • Vocabulary: 637.7K words
  • Models Analyzed: 57

Corpus Domains

Domain Documents
belgech.ru 46
intertat.tatar 19.5K
matbugat.ru 44.9K
azatliq.org 8,1k
tatar-inform.tatar 1.5K
mamadysh-rt 1.2k
vk.com 6.5K
shahrikazan.ru 2.4K
vatantat.ru 119
Wikipedia 456.1K
Books 876

🚀 Quick Start

Installation

pip install gensim huggingface_hub

Load FastText Model (Recommended)

from huggingface_hub import snapshot_download
from gensim.models import FastText
import os

# Download and load the best model
model_dir = snapshot_download(
    repo_id="arabovs-ai-lab/Tatar2Vec",
    allow_patterns="fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/*"
)

model_path = os.path.join(model_dir, "fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
model = FastText.load(model_path)

# Find similar words
similar_words = model.wv.most_similar('мәктәп', topn=5)  # school
print(similar_words)

Load Word2Vec Model

from huggingface_hub import snapshot_download
from gensim.models import Word2Vec
import os

# Download and load Word2Vec model
model_dir = snapshot_download(
    repo_id="arabovs-ai-lab/Tatar2Vec",
    allow_patterns="word2vec/w2v_dim200_win5_min5_sg.epoch4/*"
)

model_path = os.path.join(model_dir, "word2vec/w2v_dim200_win5_min5_sg.epoch4/w2v_dim200_win5_min5_sg.epoch4.model")
model = Word2Vec.load(model_path)

# Find similar words
similar_words = model.wv.most_similar('китап', topn=5)  # book
print(similar_words)

💡 Usage Examples

Education & Science

# Education related words
education_words = model.wv.most_similar('укыту', topn=5)  # teaching
science_words = model.wv.most_similar('фән', topn=5)      # science
student_words = model.wv.most_similar('студент', topn=5)  # student

print("Education:", education_words)
print("Science:", science_words)
print("Student:", student_words)

Nature & Environment

# Nature related words
nature_words = model.wv.most_similar('табигать', topn=5)  # nature
water_words = model.wv.most_similar('су', topn=5)         # water
tree_words = model.wv.most_similar('агач', topn=5)        # tree

print("Nature:", nature_words)
print("Water:", water_words)
print("Tree:", tree_words)

Culture & Arts

# Culture and arts
culture_words = model.wv.most_similar('мәдәният', topn=5)  # culture
music_words = model.wv.most_similar('музыка', topn=5)      # music
art_words = model.wv.most_similar('сәнгать', topn=5)       # art

print("Culture:", culture_words)
print("Music:", music_words)
print("Art:", art_words)

Daily Life

# Everyday words
family_words = model.wv.most_similar('гаилә', topn=5)    # family
work_words = model.wv.most_similar('эш', topn=5)         # work
home_words = model.wv.most_similar('өй', topn=5)         # home

print("Family:", family_words)
print("Work:", work_words)
print("Home:", home_words)

Food & Cuisine

# Food related words
food_words = model.wv.most_similar('ашамлык', topn=5)    # food
bread_words = model.wv.most_similar('икмәк', topn=5)     # bread
meal_words = model.wv.most_similar('аш', topn=5)         # meal

print("Food:", food_words)
print("Bread:", bread_words)
print("Meal:", meal_words)

Technology

# Technology words
tech_words = model.wv.most_similar('компьютер', topn=5)  # computer
internet_words = model.wv.most_similar('интернет', topn=5) # internet
phone_words = model.wv.most_similar('телефон', topn=5)   # phone

print("Computer:", tech_words)
print("Internet:", internet_words)
print("Phone:", phone_words)

Emotions & Feelings

# Emotional words
happy_words = model.wv.most_similar('бәхетле', topn=5)   # happy
sad_words = model.wv.most_similar('кайгы', topn=5)       # sadness
love_words = model.wv.most_similar('мәхәббәт', topn=5)   # love

print("Happy:", happy_words)
print("Sadness:", sad_words)
print("Love:", love_words)

Word Analogies

# Word analogies (like king - man + woman = queen)
# doctor - man + woman = ?
analogy = model.wv.most_similar(
    positive=['табиб', 'хатын'],  # doctor, woman
    negative=['ир'],              # man
    topn=3
)
print("Doctor - man + woman =", analogy)

# Paris - France + Germany = Berlin?
analogy = model.wv.most_similar(
    positive=['Париж', 'Алмания'],  # Paris, Germany
    negative=['Франция'],           # France
    topn=3
)
print("Paris - France + Germany =", analogy)

FastText Out-of-Vocabulary (OOV) Handling

# FastText can handle words not in the training vocabulary
if hasattr(model, 'wv') and hasattr(model.wv, 'get_vector'):
    # Try some made-up or rare words
    oov_words = ['технологияләштерү', 'цифрлаштыру', 'виртуальлаштыру']
    for word in oov_words:
        try:
            vector = model.wv[word]
            similar = model.wv.most_similar(word, topn=3)
            print(f"OOV word '{word}': {similar}")
        except Exception as e:
            print(f"Couldn't process '{word}': {e}")

Universal Loader for All Models

from huggingface_hub import snapshot_download
from gensim.models import FastText, Word2Vec
import os

def load_tatar2vec_model(model_name, model_type="fasttext"):
    """Load any Tatar2Vec model easily"""
    model_dir = snapshot_download(
        repo_id="arabovs-ai-lab/Tatar2Vec",
        allow_patterns=f"{model_type}/{model_name}/*"
    )
    
    model_path = os.path.join(model_dir, model_type, model_name, f"{model_name}.model")
    
    if model_type == "fasttext":
        return FastText.load(model_path)
    else:
        return Word2Vec.load(model_path)

# Test different models with various topics
test_words = ['тел', 'мәктәп', 'китап', 'фән', 'табигать']

models_to_test = [
    ("ft_dim100_win5_min5_ngram3-6_sg.epoch1", "fasttext", "🥇 Best FastText"),
    ("ft_dim100_win5_min5_ngram3-6_sg.epoch3", "fasttext", "🥈 Alternative FastText"),
    ("w2v_dim200_win5_min5_sg.epoch4", "word2vec", "🥇 Best Word2Vec"),
    ("w2v_dim100_win5_min5_sg", "word2vec", "🥈 Compact Word2Vec")
]

for model_name, model_type, description in models_to_test:
    print(f"\n{description}: {model_name}")
    print("=" * 50)
    
    model = load_tatar2vec_model(model_name, model_type)
    
    for word in test_words:
        try:
            similar = model.wv.most_similar(word, topn=2)
            print(f"  '{word}': {[f'{w}({s:.3f})' for w, s in similar]}")
        except KeyError:
            print(f"  '{word}': not in vocabulary")

🎯 Model Performance Summary

Model Type Best For Example Use Case
ft_dim100_win5_min5_ngram3-6_sg.epoch1 FastText Overall performance, OOV words model.wv.most_similar('мәктәп')
ft_dim100_win5_min5_ngram3-6_sg.epoch3 FastText Alternative with great results model.wv.most_similar('фән')
w2v_dim200_win5_min5_sg.epoch4 Word2Vec Semantic relationships model.wv.most_similar('тел')
w2v_dim100_win5_min5_sg Word2Vec Lightweight applications model.wv.most_similar('китап')

These examples showcase the models' understanding of various domains in the Tatar language! 🌟

📜 Citation

@misc{Tatar2Vec_20251109,
  title = {Tatar2Vec: Tatar Word Embeddings},
  author = {Arabovs AI Lab},
  year = 2025,
  publisher = {Hugging Face},
  url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}
}

📄 License

MIT License


Last updated: 2025-11-09
Best score: 0.7019
Corpus: 207.02M tokens

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using arabovs-ai-lab/Tatar2Vec 1