Tatar
ArabovMK commited on
Commit
e24a130
·
verified ·
1 Parent(s): 16333d4

Update model_comparison_report.md

Browse files
Files changed (1) hide show
  1. model_comparison_report.md +97 -0
model_comparison_report.md CHANGED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Tatar2Vec Model Comparison Report
2
+
3
+ **Date:** 2026-03-04
4
+ **Author:** Mullosharaf K. Arabov
5
+
6
+ ## Overview
7
+
8
+ This report presents a comprehensive comparison of five word embedding models trained for the Tatar language:
9
+ - Word2Vec CBOW (100 and 200 dimensions)
10
+ - Word2Vec Skip-gram (100 dimensions)
11
+ - FastText CBOW (100 and 200 dimensions)
12
+
13
+ ## Evaluation Methodology
14
+
15
+ ### Tasks
16
+ 1. **Word Analogies** – 5 semantic analogy pairs
17
+ 2. **Semantic Similarity** – Cosine similarity on 8 word pairs
18
+ 3. **OOV Handling** – Testing with 6 morphologically complex words
19
+ 4. **Nearest Neighbours** – Qualitative inspection
20
+ 5. **Dimensionality Reduction** – PCA visualization
21
+
22
+ ## Detailed Results
23
+
24
+ ### 1. Word Analogies
25
+
26
+ | Model | Accuracy | Details |
27
+ |-------|----------|---------|
28
+ | w2v_cbow_100 | 60% | ✓ Moscow:Russia = Kazan:Tatarstan<br>✓ teacher:school = doctor:hospital<br>✓ father:mother = grandfather:grandmother |
29
+ | All FastText | 0% | Failed on all analogy tasks |
30
+
31
+ ### 2. Semantic Similarity
32
+
33
+ | Word Pair | Word2Vec (cbow100) | FastText (cbow100) |
34
+ |-----------|-------------------|-------------------|
35
+ | Казан-Мәскәү | 0.777 | 0.736 |
36
+ | татар-башкорт | 0.793 | 0.823 |
37
+ | мәктәп-университет | 0.565 | 0.621 |
38
+ | укытучы-укучы | 0.742 | 0.771 |
39
+ | **Average** | **0.568** | **0.582** |
40
+
41
+ ### 3. Nearest Neighbours Analysis
42
+
43
+ #### Word2Vec (cbow100) – Clean semantic neighbours
44
+ ```
45
+ татар → Татар, башкорт, урыс, татарның, рус
46
+ Казан → Мәскәү, Чаллы, Алабуга, Чистай, Уфа
47
+ ```
48
+
49
+ #### FastText (cbow100) – Noisy neighbours with punctuation
50
+ ```
51
+ татар → милләттатар, дтатар, —татар, –татар, Ттатар
52
+ Казан → »Казан, –Казан, .Казан, )Казан, -Казан
53
+ ```
54
+
55
+ ### 4. Training Efficiency
56
+
57
+ | Model | Training Time | Vocabulary Size |
58
+ |-------|--------------|-----------------|
59
+ | w2v_cbow_100 | 1760s (29min) | 1,293,992 |
60
+ | ft_cbow_100 | 3323s (55min) | 1,293,992 |
61
+
62
+ ## Key Findings
63
+
64
+ 1. **Word2Vec CBOW (100-dim)** is the best overall model, particularly strong on semantic analogies (60% accuracy)
65
+
66
+ 2. **FastText models** show slightly better raw similarity scores but suffer from noisy representations (punctuation artifacts)
67
+
68
+ 3. **All models** achieve 100% vocabulary coverage
69
+
70
+ 4. **Word2Vec trains almost 2x faster** than FastText
71
+
72
+ ## Recommendations
73
+
74
+ | Use Case | Recommended Model |
75
+ |----------|-------------------|
76
+ | Semantic similarity, analogies | **w2v_cbow_100** |
77
+ | Morphological analysis | ft_cbow_100 |
78
+ | Maximum precision (if memory allows) | w2v_cbow_200 |
79
+ | Rare word handling | ft_cbow_100 |
80
+
81
+ ## Visualizations
82
+
83
+ ### PCA Projection (Word2Vec)
84
+ *[Here you can add a screenshot of the PCA plot]*
85
+
86
+ ### PCA Projection (FastText)
87
+ *[Here you can add a screenshot of the PCA plot]*
88
+
89
+ ## Conclusion
90
+
91
+ **Winner: Word2Vec CBOW (100 dimensions)**
92
+
93
+ Despite FastText's slightly better semantic similarity scores, Word2Vec produces cleaner, more interpretable embeddings and significantly outperforms on analogy tasks. It is recommended as the default choice for most Tatar NLP applications.
94
+
95
+ ---
96
+
97
+ *This report was generated automatically during model evaluation.*