Tatar2Vec / model_comparison_report.md

Update model_comparison_report.md

af68e34 verified 21 days ago

preview code

raw

history blame contribute delete

13.5 kB

Tatar2Vec Model Comparison Report

Date: 2026-03-04 Author: Mullosharaf K. Arabov Affiliation: Kazan Federal University

Executive Summary

This report presents a comprehensive evaluation of five word embedding models trained for the Tatar language. The Word2Vec CBOW (100-dim) model emerges as the overall winner, demonstrating superior performance on semantic analogy tasks (60% accuracy) and producing cleaner, more interpretable nearest neighbours compared to FastText alternatives.

📊 Models Evaluated

Model	Type	Architecture	Dimensions	Vocabulary
w2v_cbow_100	Word2Vec	CBOW	100	1,293,992
w2v_cbow_200	Word2Vec	CBOW	200	1,293,992
w2v_sg_100	Word2Vec	Skip-gram	100	1,293,992
ft_cbow_100	FastText	CBOW	100	1,293,992
ft_cbow_200	FastText	CBOW	200	1,293,992

Total trained models: 13 (including intermediate checkpoints)

🧪 Evaluation Methodology

Test 1: Word Analogies

Five semantic analogy pairs testing relational understanding:

Мәскәү:Россия = Казан:? (expected: Татарстан)
укытучы:мәктәп = табиб:? (expected: хастаханә)
әти:әни = бабай:? (expected: әби)
зур:кечкенә = озын:? (expected: кыска)
Казан:Татарстан = Мәскәү:? (expected: Россия)

Test 2: Semantic Similarity

Cosine similarity on eight semantically related word pairs:

Казан-Мәскәү (cities)
татар-башкорт (ethnic groups)
мәктәп-университет (educational institutions)
укытучы-укучы (teacher-student)
китап-газета (printed materials)
якшы-начар (good-bad)
йөгерү-бару (running-going)
алма-груша (fruits)

Test 3: Out-of-Vocabulary (OOV) Handling

Testing with morphologically complex forms:

Казаннан (from Kazan)
мәктәпләргә (to schools)
укыткан (taught)
татарчалаштыру (tatarization)
китапларыбызны (our books)
йөгергәннәр (they ran)

Test 4: Nearest Neighbours

Qualitative inspection of top-5 most similar words for key terms: татар, Казан, мәктәп, укытучы, якшы

Test 5: PCA Visualization

Dimensionality reduction to 2D for embedding space analysis

Test 6: Intuitive Tests

Manual verification of semantic expectations:

Expected neighbours for "татар" and "Казан"
Dissimilarity check between "мәктәп" and "хастаханә"

📈 Detailed Results

Test 1: Word Analogies

Model	Accuracy	Details
Word2Vec	60.0%	✓ Мәскәү:Россия = Казан:Татарстан (rank 5) ✓ укытучы:мәктәп = табиб:хастаханә (rank 2) ✓ әти:әни = бабай:әби (rank 1) ✗ зур:кечкенә = озын:кыска ✗ Казан:Татарстан = Мәскәү:Россия
FastText	0.0%	✗ All analogies failed

Word2Vec correct predictions:

For "хастаханә": found ['табиблар', 'хастаханә', 'хастаханәнең'] (target at rank 2)
For "әби": found ['әби', 'Бабай', 'бабайның'] (target at rank 1)
For "Татарстан": found ['Федерациясе', 'Россиянең', 'Республикасы'] (target at rank 5)

FastText typical errors:

For "Татарстан": ['.Россия', ')Россия', ';Россия'] (punctuation artifacts)
For "Россия": ['МәскәүРусия', 'Мәскәү-Татарстан', 'Татарстанхөкүмәте'] (concatenated forms)

Test 2: Semantic Similarity

Word Pair	Word2Vec (cbow100)	FastText (cbow100)
Казан-Мәскәү	0.777	0.736
татар-башкорт	0.793	0.823
мәктәп-университет	0.565	0.621
укытучы-укучы	0.742	0.771
китап-газета	0.645	0.596
якшы-начар	-0.042	0.303
йөгерү-бару	0.367	0.545
алма-груша	0.693	0.263
Average	0.568	0.582

Observations:

FastText shows slightly higher average similarity (0.582 vs 0.568)
Word2Vec better captures antonymy (якшы-начар: -0.042 vs 0.303)
FastText struggles with fruit pairs (алма-груша: 0.263 vs 0.693)

Test 3: OOV Handling

Word	In Word2Vec	In FastText
Казаннан	✓	✓
мәктәпләргә	✓	✓
укыткан	✓	✓
татарчалаштыру	✓	✓
китапларыбызны	✓	✓
йөгергәннәр	✓	✓

Note: Both models achieved 100% coverage on these morphologically complex forms, indicating the vocabulary is comprehensive.

Test 4: Nearest Neighbours Analysis

Word2Vec (cbow100) – Clean Semantic Neighbours

татар:

1. Татар      (0.889)  # Capitalized form
2. башкорт    (0.793)  # Bashkir (related ethnicity)
3. урыс       (0.788)  # Russian
4. татарның   (0.783)  # Genitive form
5. рус        (0.755)  # Russian

Казан:

1. Мәскәү     (0.777)  # Moscow
2. Чаллы      (0.771)  # Naberezhnye Chelny (Tatarstan city)
3. Алабуга    (0.733)  # Yelabuga (Tatarstan city)
4. Чистай     (0.717)  # Chistopol (Tatarstan city)
5. Уфа        (0.715)  # Ufa (Bashkortostan capital)

мәктәп:

1. Мәктәп     (0.886)  # Capitalized
2. мәктәпнең  (0.878)  # Genitive
3. гимназия   (0.818)  # Gymnasium
4. мәктәптә   (0.813)  # Locative
5. укытучылар (0.797)  # Teachers

укытучы:

1. Укытучы    (0.821)  # Capitalized
2. мәктәптә   (0.816)  # At school
3. тәрбияче   (0.806)  # Educator
4. укытучылар (0.794)  # Teachers (plural)
5. укытучысы  (0.788)  # His/her teacher

якшы:

1. фикер-ниятенә (0.758)  # Noisy
2. фильмыМарска  (0.744)  # Noisy
3. 1418,         (0.731)  # Number + punctuation
4. «мә-аа-ауу»,  (0.728)  # Onomatopoeia
5. (273          (0.723)  # Number in parentheses

FastText (cbow100) – Noisy Neighbours with Punctuation

татар:

1. милләттатар   (0.944)  # Compound
2. дтатар        (0.940)  # With prefix
3. —татар        (0.938)  # Em dash prefix
4. –татар        (0.938)  # En dash prefix
5. Ттатар        (0.934)  # Capital T prefix

Казан:

1. »Казан        (0.940)  # Quote suffix
2. –Казан        (0.937)  # Dash prefix
3. .Казан        (0.936)  # Period prefix
4. )Казан        (0.935)  # Parenthesis suffix
5. -Казан        (0.935)  # Hyphen prefix

мәктәп:

1. -мәктәп       (0.966)  # Hyphen prefix
2. —мәктәп       (0.964)  # Em dash prefix
3. мәктәп—       (0.956)  # Em dash suffix
4. "мәктәп       (0.956)  # Quote prefix
5. мәктәп…       (0.954)  # Ellipsis suffix

укытучы:

1. укытучы-      (0.951)  # Hyphen suffix
2. укытучылы     (0.945)  # With suffix
3. укытучы-тәрбияче (0.945) # Compound
4. укытучы-остаз (0.940)  # Compound
5. укытучы-хәлфә (0.935)  # Compound

якшы:

1. якш           (0.788)  # Truncated
2. як—           (0.779)  # With dash
3. ягы-ры        (0.774)  # Noisy
4. якй           (0.771)  # Noisy
5. якшмбе        (0.768)  # Possibly "якшәмбе" (Sunday) misspelled

Test 5: PCA Visualization

Model	Explained Variance (PC1+PC2)
Word2Vec	38.4%
FastText	41.2%

FastText shows slightly better variance preservation in 2D projection.

Test 6: Intuitive Tests

Word2Vec

Target: "татар" (expected: башкорт, рус, милләт) Found: ['Татар', 'башкорт', 'урыс', 'татарның', 'рус'] Matches: ['башкорт', 'рус'] ✓

Target: "Казан" (expected: Мәскәү, Уфа, шәһәр) Found: ['Мәскәү', 'Чаллы', 'Алабуга', 'Чистай', 'Уфа'] Matches: ['Мәскәү', 'Уфа'] ✓

Dissimilarity: мәктәп vs хастаханә Similarity: 0.490 (appropriately low) ✓

FastText

Target: "татар" (expected: башкорт, рус, милләт) Found: ['милләттатар', 'дтатар', '—татар', '–татар', 'Ттатар'] Matches: [] ✗

Target: "Казан" (expected: Мәскәү, Уфа, шәһәр) Found: ['»Казан', '–Казан', '.Казан', ')Казан', '-Казан'] Matches: [] ✗

Dissimilarity: мәктәп vs хастаханә Similarity: 0.514 (borderline high) ✗

📊 Comparative Summary

Metric	Word2Vec (cbow100)	FastText (cbow100)
Vocabulary Coverage	100.00%	100.00%
Analogy Accuracy	60.0%	0.0%
Average Semantic Similarity	0.568	0.582
OOV Words Found	6/6	6/6
Vocabulary Size	1,293,992	1,293,992
Training Time (seconds)	1,760	3,323
Neighbour Quality	Clean	Noisy (punctuation)
PCA Variance Explained	38.4%	41.2%
Intuitive Test Pass Rate	3/3	0/3
Weighted Final Score	0.635	0.487

🔍 Key Findings

Word2Vec significantly outperforms FastText on analogy tasks (60% vs 0%), indicating better capture of semantic relationships.
FastText produces noisier nearest neighbours, dominated by punctuation-attached forms and compounds rather than semantically related words.
Both models achieve 100% vocabulary coverage, suggesting the training corpus is well-represented.
FastText trains nearly 2x slower (3,323s vs 1,760s) with no clear benefit for this dataset.
Semantic similarity scores are comparable, with FastText slightly higher on average (0.582 vs 0.568), but this comes at the cost of interpretability.
Word2Vec better captures antonymy (якшы-начар: -0.042 vs 0.303 for FastText).
FastText's subword nature causes "semantic bleeding" where words with similar character sequences but different meanings cluster together.

🏆 Winner: Word2Vec CBOW (100 dimensions)

Weighted Scoring Rationale

The final score (0.635 for Word2Vec vs 0.487 for FastText) is based on:

Analogy performance (40% weight): Word2Vec 60% vs FastText 0%
Neighbour quality (30% weight): Word2Vec clean vs FastText noisy
Training efficiency (15% weight): Word2Vec 2x faster
Semantic similarity (15% weight): FastText slightly higher (0.582 vs 0.568)

💡 Recommendations

Use Case	Recommended Model	Rationale
Semantic search, analogies, word similarity	w2v_cbow_100	Best semantic quality, clean neighbours
Maximum precision (if resources allow)	w2v_cbow_200	Higher dimensionality captures more nuance
Morphological analysis	ft_cbow_100	Subword information helps with word forms
Handling truly rare words	ft_cbow_100	If vocabulary coverage were lower
When training speed matters	w2v_cbow_100	2x faster training

⚠️ FastText Limitations Observed

Punctuation contamination: FastText embeddings are heavily influenced by character n-grams that include punctuation, causing words with identical punctuation patterns to cluster together.
Compound word over-generation: FastText tends to create and prioritize compounds (e.g., "милләттатар" instead of "татар") as nearest neighbours.
Poor analogy performance: Despite subword information, FastText fails completely on semantic analogies.
Semantic vs. orthographic trade-off: The model optimizes for character-level similarity at the expense of semantic relationships.

🔬 Conclusion

After comprehensive evaluation across multiple tasks, Word2Vec CBOW with 100 dimensions is recommended as the default choice for most Tatar NLP applications. It provides:

✅ Superior semantic understanding (evidenced by analogy performance)
✅ Clean, interpretable nearest neighbours (actual words, not punctuation artifacts)
✅ Faster training and inference (2x faster than FastText)
✅ Good antonym capture (negative similarity for opposites)
✅ Appropriate dissimilarity for unrelated concepts

FastText, despite its theoretical advantages for morphology, underperforms on this corpus due to:

Noise from punctuation-attached forms
Over-emphasis on character n-grams at the expense of semantics
Poor analogy handling

Final verdict: 🏆 w2v_cbow_100 is the champion model.

This report was automatically generated on 2026-03-04 as part of the Tatar2Vec model evaluation pipeline. For questions or feedback, please contact the author.

Certificate: This software is registered with Rospatent under certificate No. 2026610619 (filed 2025-12-23, published 2026-01-14).