# Tatar2Vec Model Comparison Report **Date:** 2026-03-04 **Author:** Mullosharaf K. Arabov **Affiliation:** Kazan Federal University ## Executive Summary This report presents a comprehensive evaluation of five word embedding models trained for the Tatar language. The **Word2Vec CBOW (100-dim)** model emerges as the overall winner, demonstrating superior performance on semantic analogy tasks (60% accuracy) and producing cleaner, more interpretable nearest neighbours compared to FastText alternatives. ## 📊 Models Evaluated | Model | Type | Architecture | Dimensions | Vocabulary | |-------|------|--------------|------------|------------| | w2v_cbow_100 | Word2Vec | CBOW | 100 | 1,293,992 | | w2v_cbow_200 | Word2Vec | CBOW | 200 | 1,293,992 | | w2v_sg_100 | Word2Vec | Skip-gram | 100 | 1,293,992 | | ft_cbow_100 | FastText | CBOW | 100 | 1,293,992 | | ft_cbow_200 | FastText | CBOW | 200 | 1,293,992 | **Total trained models:** 13 (including intermediate checkpoints) ## 🧪 Evaluation Methodology ### Test 1: Word Analogies Five semantic analogy pairs testing relational understanding: 1. Мәскәү:Россия = Казан:? (expected: Татарстан) 2. укытучы:мәктәп = табиб:? (expected: хастаханә) 3. әти:әни = бабай:? (expected: әби) 4. зур:кечкенә = озын:? (expected: кыска) 5. Казан:Татарстан = Мәскәү:? (expected: Россия) ### Test 2: Semantic Similarity Cosine similarity on eight semantically related word pairs: - Казан-Мәскәү (cities) - татар-башкорт (ethnic groups) - мәктәп-университет (educational institutions) - укытучы-укучы (teacher-student) - китап-газета (printed materials) - якшы-начар (good-bad) - йөгерү-бару (running-going) - алма-груша (fruits) ### Test 3: Out-of-Vocabulary (OOV) Handling Testing with morphologically complex forms: - Казаннан (from Kazan) - мәктәпләргә (to schools) - укыткан (taught) - татарчалаштыру (tatarization) - китапларыбызны (our books) - йөгергәннәр (they ran) ### Test 4: Nearest Neighbours Qualitative inspection of top-5 most similar words for key terms: татар, Казан, мәктәп, укытучы, якшы ### Test 5: PCA Visualization Dimensionality reduction to 2D for embedding space analysis ### Test 6: Intuitive Tests Manual verification of semantic expectations: - Expected neighbours for "татар" and "Казан" - Dissimilarity check between "мәктәп" and "хастаханә" ## 📈 Detailed Results ### Test 1: Word Analogies | Model | Accuracy | Details | |-------|----------|---------| | **Word2Vec** | **60.0%** | ✓ Мәскәү:Россия = Казан:Татарстан (rank 5)
✓ укытучы:мәктәп = табиб:хастаханә (rank 2)
✓ әти:әни = бабай:әби (rank 1)
✗ зур:кечкенә = озын:кыска
✗ Казан:Татарстан = Мәскәү:Россия | | **FastText** | **0.0%** | ✗ All analogies failed | **Word2Vec correct predictions:** - For "хастаханә": found ['табиблар', 'хастаханә', 'хастаханәнең'] (target at rank 2) - For "әби": found ['әби', 'Бабай', 'бабайның'] (target at rank 1) - For "Татарстан": found ['Федерациясе', 'Россиянең', 'Республикасы'] (target at rank 5) **FastText typical errors:** - For "Татарстан": ['.Россия', ')Россия', ';Россия'] (punctuation artifacts) - For "Россия": ['МәскәүРусия', 'Мәскәү-Татарстан', 'Татарстанхөкүмәте'] (concatenated forms) ### Test 2: Semantic Similarity | Word Pair | Word2Vec (cbow100) | FastText (cbow100) | |-----------|-------------------|-------------------| | Казан-Мәскәү | 0.777 | 0.736 | | татар-башкорт | 0.793 | 0.823 | | мәктәп-университет | 0.565 | 0.621 | | укытучы-укучы | 0.742 | 0.771 | | китап-газета | 0.645 | 0.596 | | якшы-начар | -0.042 | 0.303 | | йөгерү-бару | 0.367 | 0.545 | | алма-груша | 0.693 | 0.263 | | **Average** | **0.568** | **0.582** | **Observations:** - FastText shows slightly higher average similarity (0.582 vs 0.568) - Word2Vec better captures antonymy (якшы-начар: -0.042 vs 0.303) - FastText struggles with fruit pairs (алма-груша: 0.263 vs 0.693) ### Test 3: OOV Handling | Word | In Word2Vec | In FastText | |------|-------------|-------------| | Казаннан | ✓ | ✓ | | мәктәпләргә | ✓ | ✓ | | укыткан | ✓ | ✓ | | татарчалаштыру | ✓ | ✓ | | китапларыбызны | ✓ | ✓ | | йөгергәннәр | ✓ | ✓ | **Note:** Both models achieved 100% coverage on these morphologically complex forms, indicating the vocabulary is comprehensive. ### Test 4: Nearest Neighbours Analysis #### Word2Vec (cbow100) – Clean Semantic Neighbours **татар:** ``` 1. Татар (0.889) # Capitalized form 2. башкорт (0.793) # Bashkir (related ethnicity) 3. урыс (0.788) # Russian 4. татарның (0.783) # Genitive form 5. рус (0.755) # Russian ``` **Казан:** ``` 1. Мәскәү (0.777) # Moscow 2. Чаллы (0.771) # Naberezhnye Chelny (Tatarstan city) 3. Алабуга (0.733) # Yelabuga (Tatarstan city) 4. Чистай (0.717) # Chistopol (Tatarstan city) 5. Уфа (0.715) # Ufa (Bashkortostan capital) ``` **мәктәп:** ``` 1. Мәктәп (0.886) # Capitalized 2. мәктәпнең (0.878) # Genitive 3. гимназия (0.818) # Gymnasium 4. мәктәптә (0.813) # Locative 5. укытучылар (0.797) # Teachers ``` **укытучы:** ``` 1. Укытучы (0.821) # Capitalized 2. мәктәптә (0.816) # At school 3. тәрбияче (0.806) # Educator 4. укытучылар (0.794) # Teachers (plural) 5. укытучысы (0.788) # His/her teacher ``` **якшы:** ``` 1. фикер-ниятенә (0.758) # Noisy 2. фильмыМарска (0.744) # Noisy 3. 1418, (0.731) # Number + punctuation 4. «мә-аа-ауу», (0.728) # Onomatopoeia 5. (273 (0.723) # Number in parentheses ``` #### FastText (cbow100) – Noisy Neighbours with Punctuation **татар:** ``` 1. милләттатар (0.944) # Compound 2. дтатар (0.940) # With prefix 3. —татар (0.938) # Em dash prefix 4. –татар (0.938) # En dash prefix 5. Ттатар (0.934) # Capital T prefix ``` **Казан:** ``` 1. »Казан (0.940) # Quote suffix 2. –Казан (0.937) # Dash prefix 3. .Казан (0.936) # Period prefix 4. )Казан (0.935) # Parenthesis suffix 5. -Казан (0.935) # Hyphen prefix ``` **мәктәп:** ``` 1. -мәктәп (0.966) # Hyphen prefix 2. —мәктәп (0.964) # Em dash prefix 3. мәктәп— (0.956) # Em dash suffix 4. "мәктәп (0.956) # Quote prefix 5. мәктәп… (0.954) # Ellipsis suffix ``` **укытучы:** ``` 1. укытучы- (0.951) # Hyphen suffix 2. укытучылы (0.945) # With suffix 3. укытучы-тәрбияче (0.945) # Compound 4. укытучы-остаз (0.940) # Compound 5. укытучы-хәлфә (0.935) # Compound ``` **якшы:** ``` 1. якш (0.788) # Truncated 2. як— (0.779) # With dash 3. ягы-ры (0.774) # Noisy 4. якй (0.771) # Noisy 5. якшмбе (0.768) # Possibly "якшәмбе" (Sunday) misspelled ``` ### Test 5: PCA Visualization | Model | Explained Variance (PC1+PC2) | |-------|------------------------------| | Word2Vec | 38.4% | | FastText | 41.2% | FastText shows slightly better variance preservation in 2D projection. ### Test 6: Intuitive Tests #### Word2Vec **Target: "татар"** (expected: башкорт, рус, милләт) Found: ['Татар', 'башкорт', 'урыс', 'татарның', 'рус'] Matches: ['башкорт', 'рус'] ✓ **Target: "Казан"** (expected: Мәскәү, Уфа, шәһәр) Found: ['Мәскәү', 'Чаллы', 'Алабуга', 'Чистай', 'Уфа'] Matches: ['Мәскәү', 'Уфа'] ✓ **Dissimilarity: мәктәп vs хастаханә** Similarity: 0.490 (appropriately low) ✓ #### FastText **Target: "татар"** (expected: башкорт, рус, милләт) Found: ['милләттатар', 'дтатар', '—татар', '–татар', 'Ттатар'] Matches: [] ✗ **Target: "Казан"** (expected: Мәскәү, Уфа, шәһәр) Found: ['»Казан', '–Казан', '.Казан', ')Казан', '-Казан'] Matches: [] ✗ **Dissimilarity: мәктәп vs хастаханә** Similarity: 0.514 (borderline high) ✗ ## 📊 Comparative Summary | Metric | Word2Vec (cbow100) | FastText (cbow100) | |--------|-------------------|-------------------| | **Vocabulary Coverage** | 100.00% | 100.00% | | **Analogy Accuracy** | **60.0%** | 0.0% | | **Average Semantic Similarity** | 0.568 | 0.582 | | **OOV Words Found** | 6/6 | 6/6 | | **Vocabulary Size** | 1,293,992 | 1,293,992 | | **Training Time (seconds)** | **1,760** | 3,323 | | **Neighbour Quality** | Clean | Noisy (punctuation) | | **PCA Variance Explained** | 38.4% | 41.2% | | **Intuitive Test Pass Rate** | 3/3 | 0/3 | | **Weighted Final Score** | **0.635** | 0.487 | ## 🔍 Key Findings 1. **Word2Vec significantly outperforms FastText on analogy tasks** (60% vs 0%), indicating better capture of semantic relationships. 2. **FastText produces noisier nearest neighbours**, dominated by punctuation-attached forms and compounds rather than semantically related words. 3. **Both models achieve 100% vocabulary coverage**, suggesting the training corpus is well-represented. 4. **FastText trains nearly 2x slower** (3,323s vs 1,760s) with no clear benefit for this dataset. 5. **Semantic similarity scores are comparable**, with FastText slightly higher on average (0.582 vs 0.568), but this comes at the cost of interpretability. 6. **Word2Vec better captures antonymy** (якшы-начар: -0.042 vs 0.303 for FastText). 7. **FastText's subword nature** causes "semantic bleeding" where words with similar character sequences but different meanings cluster together. ## 🏆 Winner: Word2Vec CBOW (100 dimensions) ### Weighted Scoring Rationale The final score (0.635 for Word2Vec vs 0.487 for FastText) is based on: - **Analogy performance** (40% weight): Word2Vec 60% vs FastText 0% - **Neighbour quality** (30% weight): Word2Vec clean vs FastText noisy - **Training efficiency** (15% weight): Word2Vec 2x faster - **Semantic similarity** (15% weight): FastText slightly higher (0.582 vs 0.568) ## 💡 Recommendations | Use Case | Recommended Model | Rationale | |----------|------------------|-----------| | **Semantic search, analogies, word similarity** | **w2v_cbow_100** | Best semantic quality, clean neighbours | | **Maximum precision (if resources allow)** | w2v_cbow_200 | Higher dimensionality captures more nuance | | **Morphological analysis** | ft_cbow_100 | Subword information helps with word forms | | **Handling truly rare words** | ft_cbow_100 | If vocabulary coverage were lower | | **When training speed matters** | w2v_cbow_100 | 2x faster training | ## ⚠️ FastText Limitations Observed 1. **Punctuation contamination**: FastText embeddings are heavily influenced by character n-grams that include punctuation, causing words with identical punctuation patterns to cluster together. 2. **Compound word over-generation**: FastText tends to create and prioritize compounds (e.g., "милләттатар" instead of "татар") as nearest neighbours. 3. **Poor analogy performance**: Despite subword information, FastText fails completely on semantic analogies. 4. **Semantic vs. orthographic trade-off**: The model optimizes for character-level similarity at the expense of semantic relationships. ## 🔬 Conclusion After comprehensive evaluation across multiple tasks, **Word2Vec CBOW with 100 dimensions** is recommended as the default choice for most Tatar NLP applications. It provides: - ✅ **Superior semantic understanding** (evidenced by analogy performance) - ✅ **Clean, interpretable nearest neighbours** (actual words, not punctuation artifacts) - ✅ **Faster training and inference** (2x faster than FastText) - ✅ **Good antonym capture** (negative similarity for opposites) - ✅ **Appropriate dissimilarity** for unrelated concepts FastText, despite its theoretical advantages for morphology, underperforms on this corpus due to: - Noise from punctuation-attached forms - Over-emphasis on character n-grams at the expense of semantics - Poor analogy handling **Final verdict: 🏆 w2v_cbow_100 is the champion model.** --- *This report was automatically generated on 2026-03-04 as part of the Tatar2Vec model evaluation pipeline. For questions or feedback, please contact the author.* **Certificate:** This software is registered with Rospatent under certificate No. 2026610619 (filed 2025-12-23, published 2026-01-14).