Update model_comparison_report.md
Browse files- model_comparison_report.md +97 -0
model_comparison_report.md
CHANGED
|
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Tatar2Vec Model Comparison Report
|
| 2 |
+
|
| 3 |
+
**Date:** 2026-03-04
|
| 4 |
+
**Author:** Mullosharaf K. Arabov
|
| 5 |
+
|
| 6 |
+
## Overview
|
| 7 |
+
|
| 8 |
+
This report presents a comprehensive comparison of five word embedding models trained for the Tatar language:
|
| 9 |
+
- Word2Vec CBOW (100 and 200 dimensions)
|
| 10 |
+
- Word2Vec Skip-gram (100 dimensions)
|
| 11 |
+
- FastText CBOW (100 and 200 dimensions)
|
| 12 |
+
|
| 13 |
+
## Evaluation Methodology
|
| 14 |
+
|
| 15 |
+
### Tasks
|
| 16 |
+
1. **Word Analogies** – 5 semantic analogy pairs
|
| 17 |
+
2. **Semantic Similarity** – Cosine similarity on 8 word pairs
|
| 18 |
+
3. **OOV Handling** – Testing with 6 morphologically complex words
|
| 19 |
+
4. **Nearest Neighbours** – Qualitative inspection
|
| 20 |
+
5. **Dimensionality Reduction** – PCA visualization
|
| 21 |
+
|
| 22 |
+
## Detailed Results
|
| 23 |
+
|
| 24 |
+
### 1. Word Analogies
|
| 25 |
+
|
| 26 |
+
| Model | Accuracy | Details |
|
| 27 |
+
|-------|----------|---------|
|
| 28 |
+
| w2v_cbow_100 | 60% | ✓ Moscow:Russia = Kazan:Tatarstan<br>✓ teacher:school = doctor:hospital<br>✓ father:mother = grandfather:grandmother |
|
| 29 |
+
| All FastText | 0% | Failed on all analogy tasks |
|
| 30 |
+
|
| 31 |
+
### 2. Semantic Similarity
|
| 32 |
+
|
| 33 |
+
| Word Pair | Word2Vec (cbow100) | FastText (cbow100) |
|
| 34 |
+
|-----------|-------------------|-------------------|
|
| 35 |
+
| Казан-Мәскәү | 0.777 | 0.736 |
|
| 36 |
+
| татар-башкорт | 0.793 | 0.823 |
|
| 37 |
+
| мәктәп-университет | 0.565 | 0.621 |
|
| 38 |
+
| укытучы-укучы | 0.742 | 0.771 |
|
| 39 |
+
| **Average** | **0.568** | **0.582** |
|
| 40 |
+
|
| 41 |
+
### 3. Nearest Neighbours Analysis
|
| 42 |
+
|
| 43 |
+
#### Word2Vec (cbow100) – Clean semantic neighbours
|
| 44 |
+
```
|
| 45 |
+
татар → Татар, башкорт, урыс, татарның, рус
|
| 46 |
+
Казан → Мәскәү, Чаллы, Алабуга, Чистай, Уфа
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
#### FastText (cbow100) – Noisy neighbours with punctuation
|
| 50 |
+
```
|
| 51 |
+
татар → милләттатар, дтатар, —татар, –татар, Ттатар
|
| 52 |
+
Казан → »Казан, –Казан, .Казан, )Казан, -Казан
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### 4. Training Efficiency
|
| 56 |
+
|
| 57 |
+
| Model | Training Time | Vocabulary Size |
|
| 58 |
+
|-------|--------------|-----------------|
|
| 59 |
+
| w2v_cbow_100 | 1760s (29min) | 1,293,992 |
|
| 60 |
+
| ft_cbow_100 | 3323s (55min) | 1,293,992 |
|
| 61 |
+
|
| 62 |
+
## Key Findings
|
| 63 |
+
|
| 64 |
+
1. **Word2Vec CBOW (100-dim)** is the best overall model, particularly strong on semantic analogies (60% accuracy)
|
| 65 |
+
|
| 66 |
+
2. **FastText models** show slightly better raw similarity scores but suffer from noisy representations (punctuation artifacts)
|
| 67 |
+
|
| 68 |
+
3. **All models** achieve 100% vocabulary coverage
|
| 69 |
+
|
| 70 |
+
4. **Word2Vec trains almost 2x faster** than FastText
|
| 71 |
+
|
| 72 |
+
## Recommendations
|
| 73 |
+
|
| 74 |
+
| Use Case | Recommended Model |
|
| 75 |
+
|----------|-------------------|
|
| 76 |
+
| Semantic similarity, analogies | **w2v_cbow_100** |
|
| 77 |
+
| Morphological analysis | ft_cbow_100 |
|
| 78 |
+
| Maximum precision (if memory allows) | w2v_cbow_200 |
|
| 79 |
+
| Rare word handling | ft_cbow_100 |
|
| 80 |
+
|
| 81 |
+
## Visualizations
|
| 82 |
+
|
| 83 |
+
### PCA Projection (Word2Vec)
|
| 84 |
+
*[Here you can add a screenshot of the PCA plot]*
|
| 85 |
+
|
| 86 |
+
### PCA Projection (FastText)
|
| 87 |
+
*[Here you can add a screenshot of the PCA plot]*
|
| 88 |
+
|
| 89 |
+
## Conclusion
|
| 90 |
+
|
| 91 |
+
**Winner: Word2Vec CBOW (100 dimensions)**
|
| 92 |
+
|
| 93 |
+
Despite FastText's slightly better semantic similarity scores, Word2Vec produces cleaner, more interpretable embeddings and significantly outperforms on analogy tasks. It is recommended as the default choice for most Tatar NLP applications.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
*This report was generated automatically during model evaluation.*
|