ArabovMK commited on
Commit
a0e97e3
·
verified ·
1 Parent(s): 0c4d0a5

Upload 6 files

Browse files
Files changed (6) hide show
  1. .gitattributes +35 -35
  2. .gitignore +2 -0
  3. README.md +192 -12
  4. app.py +590 -0
  5. dockerfile +21 -0
  6. requirements.txt +0 -0
.gitattributes CHANGED
@@ -1,35 +1,35 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ .venv/
2
+ .venv
README.md CHANGED
@@ -1,12 +1,192 @@
1
- ---
2
- title: Tatar2vec Demo
3
- emoji: 👁
4
- colorFrom: gray
5
- colorTo: gray
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- short_description: Tatar2Vec Demo - High-quality Tatar Word Embeddings Explorer
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: Tatar2Vec Explorer
2
+ emoji: 🏆
3
+ colorFrom: indigo
4
+ colorTo: purple
5
+ sdk: docker
6
+ pinned: true
7
+ app_file: app.py
8
+ ---
9
+
10
+ # 🏆 Tatar2Vec Explorer
11
+
12
+ <div align="center">
13
+
14
+ **Discover the Power of Tatar Language AI**
15
+
16
+ *High-quality word embeddings for the Tatar language*
17
+
18
+ [![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/arabovs-ai-lab/Tatar2Vec)
19
+ [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
20
+ [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
21
+
22
+ </div>
23
+
24
+ ## 🌟 Overview
25
+
26
+ Tatar2Vec represents a breakthrough in natural language processing for the Tatar language, offering state-of-the-art word embeddings that significantly outperform existing solutions. This interactive demo allows you to explore the semantic richness of Tatar through cutting-edge AI models.
27
+
28
+ ## 🚀 Features
29
+
30
+ ### 🔍 Semantic Search
31
+ - **Word Similarity**: Find semantically similar words
32
+ - **Vector Operations**: Perform complex word analogies
33
+ - **Interactive Visualizations**: Explore results with beautiful charts and word clouds
34
+
35
+ ### 🧠 Advanced Analytics
36
+ - **Model Comparison**: Compare FastText vs Word2Vec performance
37
+ - **OOV Handling**: Test out-of-vocabulary word capabilities
38
+ - **Performance Metrics**: Detailed model evaluation scores
39
+
40
+ ### 🎯 Model Variants
41
+ - **🥇 Best FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch1` (Composite: 0.7019)
42
+ - **🥈 Alternative FastText**: `ft_dim100_win5_min5_ngram3-6_sg.epoch3`
43
+ - **🥇 Best Word2Vec**: `w2v_dim200_win5_min5_sg.epoch4`
44
+ - **🥈 Compact Word2Vec**: `w2v_dim100_win5_min5_sg`
45
+
46
+ ## 📊 Performance Highlights
47
+
48
+ | Model | Composite Score | Semantic Similarity | OOV Handling |
49
+ |-------|----------------|-------------------|-------------|
50
+ | **Best FastText** | 0.7019 | 0.7368 | 1.0000 |
51
+ | Meta cc.tt.300 | 0.2000 | - | - |
52
+ | **Improvement** | **3.5×** | **Significant** | **Perfect** |
53
+
54
+ ## 🎮 Quick Start
55
+
56
+ ### Try These Examples:
57
+
58
+ #### Word Similarity
59
+ ```python
60
+ # Find words similar to "мәктәп" (school)
61
+ similar_words = model.most_similar('мәктәп', topn=10)
62
+ ```
63
+
64
+ #### Word Analogies
65
+ ```python
66
+ # Doctor - man + woman = ?
67
+ analogy = model.most_similar(
68
+ positive=['табиб', 'хатын'], # doctor, woman
69
+ negative=['ир'] # man
70
+ )
71
+ ```
72
+
73
+ #### OOV Testing (FastText Only)
74
+ ```python
75
+ # Handle unknown words
76
+ vector = model['технологияләштерү'] # technology-related word
77
+ ```
78
+
79
+ ## 🏗️ Technical Details
80
+
81
+ ### Training Corpus
82
+ - **Total Tokens**: 203.2 million
83
+ - **Vocabulary Size**: 637.7K words
84
+ - **Unique Words**: 1.8 million
85
+ - **Domains**: Wikipedia, news, books, social media
86
+
87
+ ### Model Architecture
88
+ - **FastText**: Subword information support
89
+ - **Word2Vec**: Classical word embeddings
90
+ - **Optimized**: Skip-gram architecture, 100 dimensions
91
+
92
+ ## 📚 Use Cases
93
+
94
+ ### 🎓 Education
95
+ - Language learning applications
96
+ - Educational content analysis
97
+ - Academic research
98
+
99
+ ### 💼 Business
100
+ - Content recommendation systems
101
+ - Search engine enhancement
102
+ - Customer feedback analysis
103
+
104
+ ### 🔬 Research
105
+ - Linguistic studies
106
+ - Cross-lingual comparisons
107
+ - AI model development
108
+
109
+ ## 🛠️ Installation
110
+
111
+ ### Local Development
112
+ ```bash
113
+ git clone https://huggingface.co/spaces/arabovs-ai-lab/tatar2vec-demo
114
+ cd tatar2vec-demo
115
+ pip install -r requirements.txt
116
+ streamlit run app.py
117
+ ```
118
+
119
+ ### Docker Deployment
120
+ ```bash
121
+ docker build -t tatar2vec-demo .
122
+ docker run -p 7860:7860 tatar2vec-demo
123
+ ```
124
+
125
+ ## 🌐 API Access
126
+
127
+ ```python
128
+ from huggingface_hub import snapshot_download
129
+ from gensim.models import FastText
130
+
131
+ # Download and load the best model
132
+ model_dir = snapshot_download(repo_id="arabovs-ai-lab/Tatar2Vec")
133
+ model = FastText.load(f"{model_dir}/fasttext/ft_dim100_win5_min5_ngram3-6_sg.epoch1/ft_dim100_win5_min5_ngram3-6_sg.epoch1.model")
134
+
135
+ # Use the model
136
+ similar_words = model.wv.most_similar('мәктәп')
137
+ ```
138
+
139
+ ## 📊 Evaluation Metrics
140
+
141
+ Our models were evaluated on multiple dimensions:
142
+ - **Semantic Similarity**: Human-judged word pairs
143
+ - **Analogy Accuracy**: Word relationship tasks
144
+ - **OOV Handling**: Unknown word processing
145
+ - **Neighbor Coherence**: Semantic consistency
146
+
147
+ ## 🤝 Contributing
148
+
149
+ We welcome contributions from the community! Areas of interest:
150
+ - Additional evaluation benchmarks
151
+ - New model architectures
152
+ - Expanded training data
153
+ - Multilingual applications
154
+
155
+ ## 📜 Citation
156
+
157
+ If you use Tatar2Vec in your research, please cite:
158
+
159
+ ```bibtex
160
+ @misc{tatar2vec2025,
161
+ title = {Tatar2Vec: High-Quality Tatar Word Embeddings},
162
+ author = {Arabovs AI Lab},
163
+ year = {2025},
164
+ publisher = {Hugging Face},
165
+ url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec},
166
+ note = {Version 1.0}
167
+ }
168
+ ```
169
+
170
+ ## 📄 License
171
+
172
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
173
+
174
+ ## 🙏 Acknowledgments
175
+
176
+ - Tatar language speakers and contributors
177
+ - Hugging Face for platform support
178
+ - Open-source community for tools and libraries
179
+
180
+ ---
181
+
182
+ <div align="center">
183
+
184
+ **Empowering Tatar Language Technology**
185
+
186
+ *Brought to you by [Arabovs AI Lab](https://huggingface.co/arabovs-ai-lab)*
187
+
188
+ [Report Issues](https://github.com/arabovs-ai-lab/tatar2vec/issues) •
189
+ [Request Features](https://github.com/arabovs-ai-lab/tatar2vec/discussions) •
190
+ [Contact Team](mailto:contact@arabovs-ai-lab.com)
191
+
192
+ </div>
app.py ADDED
@@ -0,0 +1,590 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Tatar2Vec Demo - Interactive Word Embeddings Explorer
3
+ Run: streamlit run app.py
4
+ """
5
+
6
+ import streamlit as st
7
+ import pandas as pd
8
+ import numpy as np
9
+ import plotly.express as px
10
+ import plotly.graph_objects as go
11
+ from plotly.subplots import make_subplots
12
+ import tempfile
13
+ import os
14
+ import sys
15
+ from pathlib import Path
16
+ from typing import List, Dict, Tuple, Optional
17
+ import requests
18
+ import json
19
+
20
+ # Import for model loading from Hugging Face Hub
21
+ from huggingface_hub import snapshot_download
22
+ from gensim.models import FastText, Word2Vec
23
+ import gensim.downloader as api
24
+
25
+ # Page configuration
26
+ st.set_page_config(
27
+ page_title="Tatar2Vec Demo",
28
+ page_icon="🏆",
29
+ layout="wide",
30
+ initial_sidebar_state="expanded"
31
+ )
32
+
33
+ # Custom CSS for improved styling
34
+ st.markdown("""
35
+ <style>
36
+ .main-header {
37
+ font-size: 2.5rem;
38
+ color: #1f77b4;
39
+ text-align: center;
40
+ margin-bottom: 2rem;
41
+ }
42
+ .model-card {
43
+ background-color: #f0f2f6;
44
+ padding: 1.5rem;
45
+ border-radius: 10px;
46
+ border-left: 4px solid #1f77b4;
47
+ margin-bottom: 1rem;
48
+ }
49
+ .metric-card {
50
+ background-color: white;
51
+ padding: 1rem;
52
+ border-radius: 8px;
53
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1);
54
+ text-align: center;
55
+ }
56
+ .word-cloud {
57
+ background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
58
+ color: white;
59
+ padding: 0.5rem 1rem;
60
+ border-radius: 20px;
61
+ display: inline-block;
62
+ margin: 0.2rem;
63
+ font-weight: 500;
64
+ }
65
+ </style>
66
+ """, unsafe_allow_html=True)
67
+
68
+ class Tatar2VecExplorer:
69
+ def __init__(self):
70
+ self.loaded_models = {}
71
+ self.available_models = {
72
+ "FastText": {
73
+ "best": "ft_dim100_win5_min5_ngram3-6_sg.epoch1",
74
+ "alternative": "ft_dim100_win5_min5_ngram3-6_sg.epoch3"
75
+ },
76
+ "Word2Vec": {
77
+ "best": "w2v_dim200_win5_min5_sg.epoch4",
78
+ "alternative": "w2v_dim100_win5_min5_sg"
79
+ }
80
+ }
81
+
82
+ @st.cache_resource(show_spinner="Loading Tatar2Vec model...")
83
+ def load_model(_self, model_name: str, model_type: str = "fasttext"):
84
+ """Load model with caching for better performance"""
85
+ try:
86
+ # Download model from Hugging Face Hub
87
+ model_dir = snapshot_download(
88
+ repo_id="arabovs-ai-lab/Tatar2Vec",
89
+ allow_patterns=f"{model_type}/{model_name}/*"
90
+ )
91
+
92
+ # Construct model path
93
+ model_path = os.path.join(model_dir, model_type, model_name, f"{model_name}.model")
94
+
95
+ # Load appropriate model type
96
+ if model_type == "fasttext":
97
+ model = FastText.load(model_path)
98
+ else:
99
+ model = Word2Vec.load(model_path)
100
+
101
+ return model
102
+ except Exception as e:
103
+ st.error(f"Error loading model: {e}")
104
+ return None
105
+
106
+ def get_model_display_name(self, model_key: str) -> str:
107
+ """Get human-readable model name"""
108
+ names = {
109
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch1": "🥇 Best FastText",
110
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch3": "🥈 Alternative FastText",
111
+ "w2v_dim200_win5_min5_sg.epoch4": "🥇 Best Word2Vec",
112
+ "w2v_dim100_win5_min5_sg": "🥈 Compact Word2Vec"
113
+ }
114
+ return names.get(model_key, model_key)
115
+
116
+ def get_model_performance(self, model_key: str) -> dict:
117
+ """Get model performance metrics"""
118
+ performance = {
119
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch1": {
120
+ "composite": 0.7019, "semantic": 0.7368, "analogy": 0.0476,
121
+ "oov": 1.0000, "coherence": 0.9588
122
+ },
123
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch3": {
124
+ "composite": 0.6675, "semantic": 0.6894, "analogy": 0.0476,
125
+ "oov": 1.0000, "coherence": 0.9388
126
+ },
127
+ "w2v_dim200_win5_min5_sg.epoch4": {
128
+ "composite": 0.5685, "semantic": 0.4445, "analogy": 0.3214,
129
+ "oov": 0.3854, "coherence": 0.7307
130
+ },
131
+ "w2v_dim100_win5_min5_sg": {
132
+ "composite": 0.5566, "semantic": 0.5187, "analogy": 0.2500,
133
+ "oov": 0.3854, "coherence": 0.8051
134
+ }
135
+ }
136
+ return performance.get(model_key, {})
137
+
138
+ def find_similar_words(self, model, word: str, topn: int = 10):
139
+ """Find semantically similar words"""
140
+ try:
141
+ if hasattr(model, 'wv'):
142
+ return model.wv.most_similar(word, topn=topn)
143
+ else:
144
+ return model.most_similar(word, topn=topn)
145
+ except KeyError:
146
+ return []
147
+ except Exception as e:
148
+ st.error(f"Error finding similar words: {e}")
149
+ return []
150
+
151
+ def word_analogy(self, model, positive: List[str], negative: List[str], topn: int = 5):
152
+ """Perform word analogy operation (king - man + woman = queen)"""
153
+ try:
154
+ if hasattr(model, 'wv'):
155
+ return model.wv.most_similar(positive=positive, negative=negative, topn=topn)
156
+ else:
157
+ return model.most_similar(positive=positive, negative=negative, topn=topn)
158
+ except Exception as e:
159
+ st.error(f"Error performing analogy: {e}")
160
+ return []
161
+
162
+ def get_word_vector(self, model, word: str):
163
+ """Get word vector representation"""
164
+ try:
165
+ if hasattr(model, 'wv'):
166
+ return model.wv[word]
167
+ else:
168
+ return model[word]
169
+ except KeyError:
170
+ return None
171
+
172
+ def handle_oov_words(self, model, words: List[str]):
173
+ """Handle Out-of-Vocabulary words (FastText only)"""
174
+ results = []
175
+ for word in words:
176
+ try:
177
+ vector = self.get_word_vector(model, word)
178
+ similar = self.find_similar_words(model, word, 3)
179
+ results.append({
180
+ 'word': word,
181
+ 'in_vocab': vector is not None,
182
+ 'similar_words': similar
183
+ })
184
+ except Exception:
185
+ results.append({
186
+ 'word': word,
187
+ 'in_vocab': False,
188
+ 'similar_words': []
189
+ })
190
+ return results
191
+
192
+ def create_performance_comparison():
193
+ """Create model performance comparison charts"""
194
+ models = [
195
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch1",
196
+ "ft_dim100_win5_min5_ngram3-6_sg.epoch3",
197
+ "w2v_dim200_win5_min5_sg.epoch4",
198
+ "w2v_dim100_win5_min5_sg",
199
+ "cc.tt.300 (Meta)"
200
+ ]
201
+
202
+ composite_scores = [0.7019, 0.6675, 0.5685, 0.5566, 0.2000]
203
+ semantic_scores = [0.7368, 0.6894, 0.4445, 0.5187, None]
204
+
205
+ # Create subplots for comparison
206
+ fig = make_subplots(
207
+ rows=1, cols=2,
208
+ subplot_titles=('Composite Score', 'Semantic Similarity'),
209
+ specs=[[{"type": "bar"}, {"type": "bar"}]]
210
+ )
211
+
212
+ # Composite scores
213
+ fig.add_trace(
214
+ go.Bar(name='Composite Score', x=models, y=composite_scores,
215
+ marker_color=['#1f77b4', '#1f77b4', '#ff7f0e', '#ff7f0e', '#d62728']),
216
+ row=1, col=1
217
+ )
218
+
219
+ # Filter out None values for semantic similarity
220
+ semantic_models = [models[i] for i in range(len(models)) if semantic_scores[i] is not None]
221
+ semantic_values = [score for score in semantic_scores if score is not None]
222
+
223
+ # Semantic similarity scores
224
+ fig.add_trace(
225
+ go.Bar(name='Semantic Similarity', x=semantic_models, y=semantic_values,
226
+ marker_color=['#1f77b4', '#1f77b4', '#ff7f0e', '#ff7f0e']),
227
+ row=1, col=2
228
+ )
229
+
230
+ fig.update_layout(
231
+ title_text="Model Performance Comparison",
232
+ showlegend=False,
233
+ height=400
234
+ )
235
+
236
+ return fig
237
+
238
+ def create_word_cloud(similar_words, title):
239
+ """Create word cloud visualization for similar words"""
240
+ if not similar_words:
241
+ return None
242
+
243
+ words = [word for word, score in similar_words]
244
+ scores = [score for word, score in similar_words]
245
+
246
+ # Normalize scores for font sizes
247
+ sizes = [30 + (score * 70) for score in scores]
248
+
249
+ fig = go.Figure()
250
+
251
+ # Add each word as annotation with random position
252
+ for i, (word, size) in enumerate(zip(words, sizes)):
253
+ fig.add_annotation(
254
+ text=word,
255
+ x=np.random.uniform(0.1, 0.9),
256
+ y=np.random.uniform(0.1, 0.9),
257
+ showarrow=False,
258
+ font=dict(size=size, color=f"hsl({i*40}, 70%, 50%)"),
259
+ bgcolor="rgba(255,255,255,0.7)",
260
+ bordercolor="rgba(0,0,0,0.1)",
261
+ borderwidth=1,
262
+ borderpad=2,
263
+ )
264
+
265
+ fig.update_layout(
266
+ title=title,
267
+ xaxis=dict(showticklabels=False, showgrid=False, zeroline=False),
268
+ yaxis=dict(showticklabels=False, showgrid=False, zeroline=False),
269
+ plot_bgcolor='rgba(0,0,0,0)',
270
+ height=300,
271
+ margin=dict(l=20, r=20, t=40, b=20)
272
+ )
273
+
274
+ return fig
275
+
276
+ def main():
277
+ # Application header
278
+ st.markdown('<h1 class="main-header">🏆 Tatar2Vec Demo - Tatar Word Embeddings</h1>', unsafe_allow_html=True)
279
+
280
+ # Initialize explorer
281
+ explorer = Tatar2VecExplorer()
282
+
283
+ # Sidebar configuration
284
+ with st.sidebar:
285
+ st.header("⚙️ Model Settings")
286
+
287
+ # Model type selection
288
+ model_type = st.selectbox(
289
+ "Model Type:",
290
+ ["FastText", "Word2Vec"],
291
+ index=0
292
+ )
293
+
294
+ # Model variant selection
295
+ model_variant = st.radio(
296
+ "Model Variant:",
297
+ ["best", "alternative"],
298
+ format_func=lambda x: "🥇 Best Model" if x == "best" else "🥈 Alternative Model"
299
+ )
300
+
301
+ model_key = explorer.available_models[model_type][model_variant]
302
+
303
+ # Model information section
304
+ st.markdown("---")
305
+ st.subheader("📊 Model Information")
306
+ performance = explorer.get_model_performance(model_key)
307
+
308
+ if performance:
309
+ col1, col2 = st.columns(2)
310
+ with col1:
311
+ st.metric("Composite Score", f"{performance['composite']:.4f}")
312
+ st.metric("Semantic Similarity", f"{performance['semantic']:.4f}")
313
+ with col2:
314
+ st.metric("Analogy Accuracy", f"{performance['analogy']:.4f}")
315
+ st.metric("OOV Handling", f"{performance['oov']:.4f}")
316
+
317
+ # Quick search examples
318
+ st.markdown("---")
319
+ st.subheader("🔍 Quick Search")
320
+ quick_words = ["мәктәп", "китап", "тел", "фән", "табигать"]
321
+ selected_quick = st.selectbox("Example words:", quick_words)
322
+
323
+ if st.button("Quick Similarity Search"):
324
+ st.session_state.quick_search = selected_quick
325
+
326
+ # Main content area with tabs
327
+ tab1, tab2, tab3, tab4 = st.tabs(["🔍 Word Search", "🧠 Analogies", "📊 Analysis", "ℹ️ About"])
328
+
329
+ with tab1:
330
+ st.header("Similar Word Search")
331
+
332
+ col1, col2 = st.columns([2, 1])
333
+
334
+ with col1:
335
+ search_word = st.text_input(
336
+ "Enter Tatar word:",
337
+ value=getattr(st.session_state, 'quick_search', 'мәктәп'),
338
+ placeholder="e.g., мәктәп, китап, тел..."
339
+ )
340
+
341
+ with col2:
342
+ top_n = st.slider("Number of similar words:", 5, 20, 10)
343
+
344
+ if st.button("Find Similar Words") or search_word:
345
+ with st.spinner(f"Loading model and finding words similar to '{search_word}'..."):
346
+ model = explorer.load_model(model_key, model_type.lower())
347
+
348
+ if model and search_word.strip():
349
+ similar_words = explorer.find_similar_words(model, search_word.strip(), top_n)
350
+
351
+ if similar_words:
352
+ # Display results in two columns
353
+ col1, col2 = st.columns([1, 1])
354
+
355
+ with col1:
356
+ st.subheader("📈 Similar Words")
357
+ df = pd.DataFrame(similar_words, columns=["Word", "Similarity"])
358
+ st.dataframe(df, use_container_width=True)
359
+
360
+ with col2:
361
+ fig = create_word_cloud(similar_words, f"Words similar to '{search_word}'")
362
+ if fig:
363
+ st.plotly_chart(fig, use_container_width=True)
364
+
365
+ # Additional information
366
+ st.subheader("📋 Details")
367
+ col1, col2, col3 = st.columns(3)
368
+
369
+ with col1:
370
+ try:
371
+ vector = explorer.get_word_vector(model, search_word.strip())
372
+ if vector is not None:
373
+ st.metric("Vector Dimension", len(vector))
374
+ except:
375
+ pass
376
+
377
+ with col2:
378
+ st.metric("Similar Words Found", len(similar_words))
379
+
380
+ with col3:
381
+ if similar_words:
382
+ st.metric("Max Similarity", f"{similar_words[0][1]:.4f}")
383
+
384
+ else:
385
+ st.warning(f"Word '{search_word}' not found in model vocabulary.")
386
+
387
+ with tab2:
388
+ st.header("Word Analogies")
389
+
390
+ st.markdown("""
391
+ **Example:** табиб - ир + хатын = ? (doctor - man + woman = female doctor)
392
+ """)
393
+
394
+ col1, col2, col3 = st.columns(3)
395
+
396
+ with col1:
397
+ positive1 = st.text_input("Positive word 1:", "табиб", placeholder="doctor")
398
+ positive2 = st.text_input("Positive word 2:", "хатын", placeholder="woman")
399
+
400
+ with col2:
401
+ negative = st.text_input("Negative word:", "ир", placeholder="man")
402
+
403
+ with col3:
404
+ analogy_topn = st.slider("Number of results:", 3, 10, 5)
405
+
406
+ if st.button("Perform Analogy"):
407
+ if positive1 and positive2 and negative:
408
+ with st.spinner("Performing analogy..."):
409
+ model = explorer.load_model(model_key, model_type.lower())
410
+
411
+ if model:
412
+ analogy_results = explorer.word_analogy(
413
+ model,
414
+ positive=[positive1, positive2],
415
+ negative=[negative],
416
+ topn=analogy_topn
417
+ )
418
+
419
+ if analogy_results:
420
+ st.subheader("🎯 Analogy Results")
421
+
422
+ df = pd.DataFrame(analogy_results, columns=["Word", "Similarity"])
423
+ st.dataframe(df, use_container_width=True)
424
+
425
+ # Visualization
426
+ fig = px.bar(
427
+ df,
428
+ x='Similarity',
429
+ y='Word',
430
+ orientation='h',
431
+ title=f"Analogy: {positive1} - {negative} + {positive2}",
432
+ color='Similarity',
433
+ color_continuous_scale='viridis'
434
+ )
435
+ fig.update_layout(yaxis={'categoryorder':'total ascending'})
436
+ st.plotly_chart(fig, use_container_width=True)
437
+ else:
438
+ st.error("Could not perform analogy. Please check the input words.")
439
+
440
+ # Predefined analogy examples
441
+ st.subheader("🎪 Example Analogies")
442
+
443
+ presets = {
444
+ "Education": ("укытучы", "мәктәп", "өй", "teacher - home + school"),
445
+ "Family": ("ата", "кыз", "ул", "father - son + daughter"),
446
+ "Professions": ("шеф", "аш", "ресторан", "chef - restaurant + food")
447
+ }
448
+
449
+ cols = st.columns(len(presets))
450
+ for idx, (name, (p1, p2, n, desc)) in enumerate(presets.items()):
451
+ with cols[idx]:
452
+ if st.button(f"🧩 {name}", key=f"preset_{idx}"):
453
+ st.session_state.analogy_p1 = p1
454
+ st.session_state.analogy_p2 = p2
455
+ st.session_state.analogy_n = n
456
+ st.rerun()
457
+
458
+ with tab3:
459
+ st.header("Model Analysis")
460
+
461
+ # Performance comparison
462
+ st.subheader("📊 Model Performance Comparison")
463
+ perf_fig = create_performance_comparison()
464
+ st.plotly_chart(perf_fig, use_container_width=True)
465
+
466
+ # OOV words testing
467
+ st.subheader("🔤 OOV (Out-of-Vocabulary) Testing")
468
+
469
+ st.markdown("""
470
+ **FastText models** can handle words not seen during training
471
+ thanks to subword information.
472
+ """)
473
+
474
+ oov_words = st.text_area(
475
+ "Enter words for OOV testing (one per line):",
476
+ "технологияләштерү\nцифрлаштыру\nвиртуальлаштыру\nмәктәпчә"
477
+ )
478
+
479
+ if st.button("Test OOV") and model_type == "FastText":
480
+ test_words = [word.strip() for word in oov_words.split('\n') if word.strip()]
481
+
482
+ with st.spinner("Testing OOV words..."):
483
+ model = explorer.load_model(model_key, "fasttext")
484
+
485
+ if model:
486
+ results = explorer.handle_oov_words(model, test_words)
487
+
488
+ st.subheader("OOV Testing Results")
489
+
490
+ for result in results:
491
+ col1, col2 = st.columns([1, 3])
492
+
493
+ with col1:
494
+ status = "✅ In Vocabulary" if result['in_vocab'] else "🆕 OOV Word"
495
+ st.write(f"**{result['word']}** - {status}")
496
+
497
+ with col2:
498
+ if result['similar_words']:
499
+ similar_str = ", ".join([f"{word}({score:.3f})" for word, score in result['similar_words']])
500
+ st.write(f"Similar: {similar_str}")
501
+ else:
502
+ st.write("No similar words found")
503
+
504
+ # Model comparison
505
+ st.subheader("🔄 Model Comparison")
506
+
507
+ compare_words = st.text_input("Words to compare across models (comma-separated):", "мәктәп, китап, тел, фән")
508
+
509
+ if st.button("Compare Models"):
510
+ words_to_compare = [word.strip() for word in compare_words.split(',')]
511
+
512
+ comparison_data = []
513
+
514
+ for model_type_comp in ["FastText", "Word2Vec"]:
515
+ for variant in ["best", "alternative"]:
516
+ model_key_comp = explorer.available_models[model_type_comp][variant]
517
+
518
+ with st.spinner(f"Testing {model_key_comp}..."):
519
+ model = explorer.load_model(model_key_comp, model_type_comp.lower())
520
+
521
+ if model:
522
+ for word in words_to_compare:
523
+ similar = explorer.find_similar_words(model, word, 3)
524
+ if similar:
525
+ for sim_word, score in similar:
526
+ comparison_data.append({
527
+ 'Model': explorer.get_model_display_name(model_key_comp),
528
+ 'Type': model_type_comp,
529
+ 'Source Word': word,
530
+ 'Similar Word': sim_word,
531
+ 'Similarity': score
532
+ })
533
+
534
+ if comparison_data:
535
+ df_compare = pd.DataFrame(comparison_data)
536
+ st.dataframe(df_compare, use_container_width=True)
537
+
538
+ with tab4:
539
+ st.header("ℹ️ About Tatar2Vec")
540
+
541
+ st.markdown("""
542
+ ## 🏆 Tatar2Vec - High-quality Tatar Word Embeddings
543
+
544
+ This repository contains the best performing FastText and Word2Vec models for Tatar,
545
+ selected through comprehensive evaluation of 57 different model configurations.
546
+
547
+ ### 🎯 Key Features:
548
+
549
+ - **High Quality**: Our models significantly outperform pre-trained Meta models
550
+ - **Large Vocabulary**: 637.7K words
551
+ - **Multiple Architectures**: FastText and Word2Vec
552
+ - **OOV Support**: FastText models handle out-of-vocabulary words
553
+
554
+ ### 📊 Key Results:
555
+
556
+ - **Best Model**: FastText with composite score 0.7019 (vs 0.2000 for Meta)
557
+ - **Best Architecture**: Skip-gram outperforms CBOW
558
+ - **Optimal Dimension**: 100-dimensional models perform better than 200/300-dimensional
559
+
560
+ ### 🎪 Use Cases:
561
+
562
+ - Semantic similarity search
563
+ - Word analogies
564
+ - Text classification
565
+ - Machine translation
566
+ - And much more!
567
+
568
+ ### 📚 Training Corpus:
569
+
570
+ - **Total Tokens**: 203.2 million
571
+ - **Unique Words**: 1.8 million
572
+ - **Sources**: Wikipedia, news, books, social media
573
+
574
+ ### 📜 Citation:
575
+
576
+ ```bibtex
577
+ @misc{Tatar2Vec_20251109,
578
+ title = {Tatar2Vec: Tatar Word Embeddings},
579
+ author = {Arabovs AI Lab},
580
+ year = 2025,
581
+ publisher = {Hugging Face},
582
+ url = {https://huggingface.co/arabovs-ai-lab/Tatar2Vec}
583
+ }
584
+ ```
585
+
586
+ ### 📄 License: MIT License
587
+ """)
588
+
589
+ if __name__ == "__main__":
590
+ main()
dockerfile ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y \
6
+ build-essential \
7
+ curl \
8
+ git \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ COPY requirements.txt .
12
+
13
+ RUN pip3 install --no-cache-dir -r requirements.txt
14
+
15
+ COPY . .
16
+
17
+ EXPOSE 7860
18
+
19
+ HEALTHCHECK CMD curl --fail http://localhost:7860/_stcore/health
20
+
21
+ ENTRYPOINT ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]
requirements.txt ADDED
Binary file (1.96 kB). View file